From 172ad7cd987d75655e9dd89ee4a1333a60de72f6 Mon Sep 17 00:00:00 2001 From: Max Murshed Date: Thu, 11 Jun 2026 21:49:27 -0700 Subject: [PATCH] perf: quotient-sized division for short-quotient shapes (38-75x on 2^k+1 pathology) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit New algorithms/division/QuotientSizedDivision.h: for ratio < 4/3 at large b, the (delta+1)-limb quotient is determined to +-a few units by the operand tops. With t = delta+4, divide (a >> B^(nb-t)) by (b >> B^(nb-t)) — truncating b perturbs q by <= ~B^(2-GUARD), sub-ulp — then reconstruct the exact remainder with one delta x nb back-multiply plus +-few fixups (cap 8, fallback to full Newton). Cost scales with the QUOTIENT, not the divisor. Dispatch: b >= NEWTON_BALANCED_B, a >= b + QSIZED_MIN_DELTA (64), ratio < 4/3. The tops divide (ratio ~2) goes to Newton directly at t >= 6144 — post-#85-87 Newton beats BZ at ratio 2 from ~6k limbs (13 vs 22 ms at 12k, 40 vs 299 ms at 30k, 0.16 s vs 5.3 s at 65537). Recursion cannot re-enter the band (tops ratio ~2 > 4/3). Also lowers NEWTON_BALANCED_B 98304 -> 24576: the generic ratio-4/3 Newton/BZ crossover sits at ~24k limbs after the wraparound PRs. Measured (M1 Max, min of 3): - nb=131073 (2^17+1) ratios 1.05/1.10/1.25: 1.07 s / 2.16 s / 5.35 s -> 28 / 43 / 71 ms (38-75x) - nb=120000 same ratios: 37/56/116 -> 30/39/66 ms - nb=48000 ratio 1.25: 16 ms vs Newton 39, FastDivision 733 - nb=131072 (BZ pow2 best case) 1.25: 65 vs 88 ms - no regression band This closes the last known residual in the division dispatch (ratio in (1, 4/3) at large b routing to BZ's transform-doubling blowup). Tests: 246 unit tests, div_correctness, identity + FastDivision cross-checks across the band (incl. all-max/sparse patterns, 2^k+1 sizes, band edges 98304->24576), Pow10 suite, ToString round-trips. Co-Authored-By: Claude Fable 5 --- CLAUDE.md | 5 +- docs/DIVISION.md | 6 +- include/biginteger/algorithms/Division.h | 24 +++-- .../division/QuotientSizedDivision.h | 102 ++++++++++++++++++ src/algorithms/Division.cpp | 14 +++ 5 files changed, 140 insertions(+), 11 deletions(-) create mode 100644 include/biginteger/algorithms/division/QuotientSizedDivision.h diff --git a/CLAUDE.md b/CLAUDE.md index 153f975..2b774b3 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -43,14 +43,15 @@ CI (`.github/workflows/qa.yaml`) runs `nanoclaw-task qa-agent` on PRs against `o **Division dispatch** (`algorithms/Division.h`, thresholds defined in `src/algorithms/Division.cpp`): - `NewtonDivision` (Newton-Raphson reciprocal, O(M(n)); handles arbitrary `na/nb` via blockwise mode — top chunk in [n+1, 2n], slide down by n, thread the remainder) when any skew band holds: - `b ≥ 4096` at ratio ≥ 3 (`NEWTON_SKEW` 3/1), or - - `b ≥ 98304` at ratio ≥ 4/3 (`NEWTON_BALANCED` 4/3 — the near-balanced band, PR #79, lowered from 2/1 2026-06-11), or + - `b ≥ 24576` at ratio ≥ 4/3 (`NEWTON_BALANCED` — near-balanced band; ratio lowered from 2/1 and floor from 98304 on 2026-06-11), or - `b ≥ 2048` at ratio ≥ 8 (`NEWTON_HIGH_SKEW` 8/1). +- `QuotientSizedDivision` when `b ≥ 24576`, `a ≥ b + 64`, and ratio < 4/3: divides the operand TOPS (`t = Δ+4` limbs of b, `Δ+t` of a) for the (Δ+1)-limb quotient, then one Δ×nb back-multiply for the remainder — cost scales with the quotient, not the divisor. - else `BurnikelZieglerDivision` for power-of-two base when `b > 512` and the BZ band fits (near-balanced `b ≥ 1024, b+32 ≤ a ≤ 3b`, or big-and-skewed `a > 2048 && a > 3b`). - otherwise multi-limb → `FastDivision` (Knuth Algorithm D variant) - single-limb divisor → `ClassicDivision` - `KnuthDivision` and `ReciprocalDivision` are alternates used by correctness tests for cross-checking. -The balanced band exists because BZ's recursive 2n/n halving lands intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes, blowing up 5–60× vs Newton (worst at `n = 2^k+1`); Newton pads once and stays flat. **Known residual:** ratio ∈ (1, 4/3) at large `b` still routes to BZ; generic sizes are fine there (BZ wins below the ~1.3 crossover) but `2^k+1`-family divisor sizes still blow up — the proper fix for that range is quotient-sized division. +The balanced band exists because BZ's recursive 2n/n halving lands intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes, blowing up 5–60× vs Newton (worst at `n = 2^k+1`); Newton pads once and stays flat. The former ratio ∈ (1, 4/3) residual is closed by `QuotientSizedDivision` (2026-06-11): the `2^k+1`-family blowup shapes went from seconds to tens of ms, and it beats BZ even at BZ's exact-power-of-2 best case. When adding a new algorithm, slot the implementation under `algorithms//.h`, then update the dispatch in `algorithms/.h` — the thresholds there are the only place size cutoffs live. diff --git a/docs/DIVISION.md b/docs/DIVISION.md index dab61d6..afec09e 100644 --- a/docs/DIVISION.md +++ b/docs/DIVISION.md @@ -62,7 +62,7 @@ flowchart TD C -- no --> D{Compare(a, b)} D -- a == b --> Eq[return {1, 0}] D -- a < b --> Less[return {0, a}] - D -- a > b --> E{Newton band?
b ≥ 4096 and a ≥ 3b
OR b ≥ 98304 and 3a ≥ 4b
OR b ≥ 2048 and a ≥ 8b} + D -- a > b --> E{Newton band?
b ≥ 4096 and a ≥ 3b
OR b ≥ 24576 and 3a ≥ 4b
OR b ≥ 2048 and a ≥ 8b} E -- yes --> N[NewtonDivision] E -- no --> F{Power-of-two base
AND b.size > 512
AND BZ shape fits?} F -- yes --> BZ[BurnikelZieglerDivision] @@ -105,7 +105,9 @@ The current dispatch logic, paraphrased: The ordering matters: Newton wins on **large skewed** problems because the per-divisor reciprocal setup amortizes over multiple chunks. BZ wins on **mid-size near-balanced** problems where its 2n/n recursion structure beats both FastDivision and Newton's setup cost. -**Near-balanced band (PR #79, lowered to 4/3 on 2026-06-11).** Above `NEWTON_BALANCED_B` (98304 limbs), ratio-≥-4/3 division goes to Newton instead of BZ. BZ's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes — the FFT length doubles and the constant factor compounds across recursion depth into a **5–60× slowdown vs Newton**, worst at `n = 2^k + 1` (measured ~90 s for a 262145-limb divisor vs Newton's ~0.9 s). Newton pads once to the working size and stays flat. Exact-power-of-2 divisor sizes are BZ's best case (it ties Newton); they regress ~4 % under this band but are rare in practice. **Band lower edge re-measured 2026-06-11** after the wraparound-Newton PRs (#85–#87) cut Newton's constant: the generic BZ/Newton crossover moved from ~2 down to ~1.3 (nb 100k–200k limbs: Newton wins from ratio 1.3–1.35 up; at 1.5 BZ is 1.7–1.9× slower, at 1.95 up to 3.4×). Band lowered `2/1 → 4/3`. Exact-power-of-2 divisors (BZ's best case) regress ≤ ~25% in the narrow (4/3, 1.4) sliver — same accepted tradeoff as PR #79. **Known residual:** ratio ∈ (1, 4/3) still routes to BZ; fine for generic sizes (BZ genuinely wins below the crossover) but `2^k+1`-family divisor sizes still hit the transform-doubling blowup (measured 1.1–5.4 s vs Newton's 0.16 s at nb = 131073, ratios 1.05–1.25). The right fix there is quotient-sized division (divide the top ~2Δ limbs by the divisor's top Δ limbs when the quotient is short), which scales with the quotient instead of the divisor. FastDivision is the default workhorse for everything else. +**Near-balanced band (PR #79; ratio lowered to 4/3 and floor to 24576 limbs on 2026-06-11).** Above `NEWTON_BALANCED_B` (24576 limbs), ratio-≥-4/3 division goes to Newton instead of BZ. BZ's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes — the FFT length doubles and the constant factor compounds across recursion depth into a **5–60× slowdown vs Newton**, worst at `n = 2^k + 1` (measured ~90 s for a 262145-limb divisor vs Newton's ~0.9 s). Newton pads once to the working size and stays flat. Exact-power-of-2 divisor sizes are BZ's best case (it ties Newton); they regress ~4 % under this band but are rare in practice. **Band re-measured 2026-06-11** after the wraparound-Newton PRs (#85–#87) cut Newton's constant: the generic BZ/Newton crossover moved from ratio ~2 down to ~1.3 at large b, and the size floor from ~100k down to ~24k limbs (ratio-1.4 crossover ≈ 24k limbs; ratio-2 crossover ≈ 6k). Band lowered `2/1 → 4/3` and `NEWTON_BALANCED_B` `98304 → 24576`. The former ratio ∈ (1, 4/3) residual is closed by **QuotientSizedDivision** (see below). FastDivision is the default workhorse for everything else. + +**Short-quotient band (`QuotientSizedDivision`, 2026-06-11).** For `b ≥ NEWTON_BALANCED_B`, `a ≥ b + 64`, ratio < 4/3: with Δ = a.size − b.size and t = Δ+4, the (Δ+1)-limb quotient is determined to ±a few units by the operand TOPS — `q_est = floor((a >> B^(nb−t)) / (b >> B^(nb−t)))` (truncating b perturbs q by ≤ ~B^(2−GUARD), sub-ulp). The tops divide (ratio ~2, size ~2Δ/Δ) goes to Newton directly at t ≥ 6144 (measured: Newton beats BZ at ratio 2 from ~6k limbs — 13 vs 22 ms at 12k, 40 vs 299 ms at 30k, 0.16 s vs 5.3 s at 65537); the exact remainder then costs one Δ×nb back-multiply plus ±few fixups (cap 8, fallback Newton). Cost scales with the **quotient**, not the divisor. Measured (nb=131073, the 2^k+1 pathology): ratios 1.05/1.10/1.25 went from 1.07 s / 2.16 s / 5.35 s (BZ) to **28 / 43 / 71 ms** (38–75×). Generic nb=120000: 37/56/116 ms → 30/39/66 ms. It also beats BZ at BZ's exact-power-of-2 best case (131072 @1.25: 65 vs 88 ms) — no regression band. `KnuthDivision` and `ReciprocalDivision` exist as alternate implementations used by correctness tests for cross-checking. They are not in the production dispatch path. diff --git a/include/biginteger/algorithms/Division.h b/include/biginteger/algorithms/Division.h index 6942e9a..ce9a21c 100644 --- a/include/biginteger/algorithms/Division.h +++ b/include/biginteger/algorithms/Division.h @@ -7,14 +7,18 @@ * - b ≥ NEWTON_MEDIUM_B AND a ≥ NEWTON_SKEW (3/1) · b * - b ≥ NEWTON_BALANCED_B AND a ≥ NEWTON_BALANCED (4/3) · b * - b ≥ NEWTON_HIGH_SKEW_B AND a ≥ NEWTON_HIGH_SKEW (8/1) · b - * The balanced (ratio ≥ 2) band starts higher (96k limbs) because BZ wins - * near-balanced below that; above it BZ degrades erratically (measured - * 2×–4.5× slower at b ≥ 100k) while Newton stays smooth. - * 2. Power-of-two base AND b > BZ_DIVISOR_THRESHOLD AND BZ band fits + * The balanced (ratio ≥ 4/3) band starts at 24k limbs — the generic + * Newton/BZ crossover measured after the wraparound-Newton PRs (#85-#87); + * below it BZ wins near-balanced, above it BZ degrades erratically. + * 2. QuotientSizedDivision when b ≥ NEWTON_BALANCED_B, a ≥ b + 64, and + * ratio < 4/3 — short-quotient shapes where cost should scale with the + * quotient, not the divisor (and where BZ blows up 7-128× on + * 2^k+1-family divisor sizes). + * 3. Power-of-two base AND b > BZ_DIVISOR_THRESHOLD AND BZ band fits * → BurnikelZieglerDivision (balanced 2n/n recursion) - * 3. else + * 4. else * → FastDivision (Knuth Algorithm D, hybrid-64-bit basecase) - * 4. single-limb divisor inside the above → ClassicDivision + * 5. single-limb divisor inside the above → ClassicDivision * * Thresholds tunable at compile time via -DBIGMATH_*=N. * @@ -32,6 +36,7 @@ #include "division/ClassicDivision.h" #include "division/FastDivision.h" #include "division/NewtonDivision.h" +#include "division/QuotientSizedDivision.h" namespace BigMath { @@ -52,7 +57,7 @@ namespace BigMath #endif #ifndef BIGMATH_NEWTON_BALANCED_B -#define BIGMATH_NEWTON_BALANCED_B 98304 +#define BIGMATH_NEWTON_BALANCED_B 24576 #endif #ifndef BIGMATH_NEWTON_BALANCED_NUMERATOR @@ -63,6 +68,10 @@ namespace BigMath #define BIGMATH_NEWTON_BALANCED_DENOMINATOR 3 #endif +#ifndef BIGMATH_QSIZED_MIN_DELTA +#define BIGMATH_QSIZED_MIN_DELTA 64 +#endif + #ifndef BIGMATH_NEWTON_HIGH_SKEW_B #define BIGMATH_NEWTON_HIGH_SKEW_B 2048 #endif @@ -82,6 +91,7 @@ namespace BigMath extern const SizeT NEWTON_BALANCED_B; extern const SizeT NEWTON_BALANCED_NUMERATOR; extern const SizeT NEWTON_BALANCED_DENOMINATOR; + extern const SizeT QSIZED_MIN_DELTA; extern const SizeT NEWTON_HIGH_SKEW_B; extern const SizeT NEWTON_HIGH_SKEW_NUMERATOR; extern const SizeT NEWTON_HIGH_SKEW_DENOMINATOR; diff --git a/include/biginteger/algorithms/division/QuotientSizedDivision.h b/include/biginteger/algorithms/division/QuotientSizedDivision.h new file mode 100644 index 0000000..0f6e2e3 --- /dev/null +++ b/include/biginteger/algorithms/division/QuotientSizedDivision.h @@ -0,0 +1,102 @@ +#ifndef QUOTIENT_SIZED_DIVISION +#define QUOTIENT_SIZED_DIVISION + +#include +#include +using namespace std; + +#include "../../common/Comparator.h" +#include "../../common/Util.h" +#include "../Addition.h" +#include "../Multiplication.h" +#include "../Subtraction.h" +#include "NewtonDivision.h" + +namespace BigMath +{ + // Top-level dispatcher (defined in src/algorithms/Division.cpp). The tops + // sub-division below routes through it so the recursion picks the right + // strategy for its own size; its ratio is ~2, so it can never re-enter the + // quotient-sized band (ratio < 4/3). + pair, vector> DivideAndRemainder( + vector const &a, + vector const &b, + BaseT base, + bool computeRemainder); + + // Short-quotient division. When Δ = a.size() − b.size() is small relative + // to the divisor, the quotient has only Δ+1 limbs and is determined to ±a + // few units by the operands' TOPS: with t = Δ + GUARD, + // q_est = floor( (a >> B^(nb−t)) / (b >> B^(nb−t)) ) + // truncating b low limbs perturbs a/b by a relative ~B^(1−t), i.e. the + // (Δ+1)-limb quotient by ≤ ~B^(Δ+2−t) = B^(2−GUARD) — sub-ulp. The exact + // remainder then comes from ONE Δ×nb back-multiply plus ±few fixups. + // + // Cost: T(2Δ/Δ tops divide) + M(Δ, nb) + O(nb) — scales with the QUOTIENT, + // not the divisor. This is the right shape for ratio ∈ (1, 4/3) at large b, + // where full Newton pays an nb-sized reciprocal for a tiny quotient and + // BZ's recursion blows up 7–128× on 2^k+1-family divisor sizes. + class QuotientSizedDivision + { + public: + static constexpr SizeT GUARD = 4; + + static pair, vector> DivideAndRemainder( + vector const &a, + vector const &b, + BaseT base, + bool computeRemainder = true) + { + SizeT na = (SizeT)a.size(); + SizeT nb = (SizeT)b.size(); + // Caller contract (dispatch): a > b, and delta + GUARD < nb. + SizeT delta = na - nb; + SizeT t = delta + GUARD; + SizeT s = nb - t; + + vector a_top(a.begin() + s, a.end()); + vector b_top(b.begin() + s, b.end()); + + // Tops divide is ratio ~2. Post-#85-87 Newton beats BZ there from + // ~6k limbs (measured: 12k limbs 13 vs 22 ms, 30k 40 vs 299 ms, + // 65537 161 ms vs 5.3 s); below that the dispatcher's BZ/Fast pick + // is near-tied. + vector q = (t >= 6144) + ? NewtonDivision::DivideAndRemainder(a_top, b_top, base, false).first + : BigMath::DivideAndRemainder(a_top, b_top, base, false).first; + + // Reconstruct: r = a − q·b, fixing q's ±few-unit estimate error. + vector qb = Multiply(q, b, base); + + static const vector one{1}; + const int FIXUP_LIMIT = 8; + + int iters = 0; + while (Compare(qb, a) > 0) + { + if (++iters > FIXUP_LIMIT) + return NewtonDivision::DivideAndRemainder(a, b, base, computeRemainder); + q = Subtract(q, one, base); + qb = Subtract(qb, b, base); + } + vector r = Subtract(a, qb, base); + + iters = 0; + while (Compare(r, b) >= 0) + { + if (++iters > FIXUP_LIMIT) + return NewtonDivision::DivideAndRemainder(a, b, base, computeRemainder); + r = Subtract(r, b, base); + q = Add(q, one, base); + } + + TrimZerosToOne(q); + if (!computeRemainder) + return {q, vector()}; + TrimZerosToOne(r); + return {q, r}; + } + }; +} + +#endif diff --git a/src/algorithms/Division.cpp b/src/algorithms/Division.cpp index 7a66ff8..fe0077e 100644 --- a/src/algorithms/Division.cpp +++ b/src/algorithms/Division.cpp @@ -5,6 +5,7 @@ */ #include "biginteger/algorithms/Division.h" +#include "biginteger/algorithms/division/QuotientSizedDivision.h" #include @@ -17,6 +18,7 @@ namespace BigMath const SizeT NEWTON_BALANCED_B = BIGMATH_NEWTON_BALANCED_B; const SizeT NEWTON_BALANCED_NUMERATOR = BIGMATH_NEWTON_BALANCED_NUMERATOR; const SizeT NEWTON_BALANCED_DENOMINATOR = BIGMATH_NEWTON_BALANCED_DENOMINATOR; + const SizeT QSIZED_MIN_DELTA = BIGMATH_QSIZED_MIN_DELTA; const SizeT NEWTON_HIGH_SKEW_B = BIGMATH_NEWTON_HIGH_SKEW_B; const SizeT NEWTON_HIGH_SKEW_NUMERATOR = BIGMATH_NEWTON_HIGH_SKEW_NUMERATOR; const SizeT NEWTON_HIGH_SKEW_DENOMINATOR = BIGMATH_NEWTON_HIGH_SKEW_DENOMINATOR; @@ -60,6 +62,18 @@ namespace BigMath if (newton_eligible) return NewtonDivision::DivideAndRemainder(a, b, base, computeRemainder); + // Short-quotient band: large divisors below the Newton balanced band + // (ratio < 4/3) with a quotient big enough that FastDivision's O(n*delta) + // loses. BZ's near-balanced path blows up 7-128x on 2^k+1-family divisor + // sizes here; quotient-sized division scales with the quotient instead. + bool qsized_eligible = + (base == Base2_32 || base == Base2_64) && + b.size() >= NEWTON_BALANCED_B && + a.size() >= b.size() + QSIZED_MIN_DELTA && + NEWTON_BALANCED_DENOMINATOR * a.size() < NEWTON_BALANCED_NUMERATOR * b.size(); + if (qsized_eligible) + return QuotientSizedDivision::DivideAndRemainder(a, b, base, computeRemainder); + // BZ for large near-balanced divisors and for big-and-skewed cases. // The +32-limb quotient-bulk guard in the near-balanced clause excludes // degenerate cases where a ≈ b and the quotient is 0-2 limbs — BZ would