diff --git a/BENCHMARK.md b/BENCHMARK.md index 9c698f3..2db1f04 100644 --- a/BENCHMARK.md +++ b/BENCHMARK.md @@ -437,6 +437,7 @@ ToString (`BigDecimal → string`): - **Multiplication skewed:** parity around 500k×50k and 1M×100k, with a slight BigMath lead at 2M×200k. BigMath falls behind at 50M×5M, 100M×10M, and 200M×20M. - **Division skewed:** 1.86×-6.51× behind GMP in the main band, with the new 200M×40M row at 2.85×. Worst remains 200k×50k (BZ band). The large-multiplication speedups still flow through Newton, but the structure overhead never disappears. - **Division balanced 5M×5M:** PR #56 fix routes degenerate-quotient cases to FastDivision; 27.03× → 7.20×. +- **Division near-balanced (ratio ≈ 2, large `b`):** PR #79 added a Newton balanced band (`b ≥ 98304, a ≥ 2b`) that dodges BZ's 5–60× non-power-of-2 blowup — 8.8× at 500k/250k, 97.7× at the 2¹⁸+1 worst case. Division is now NTT-bound for ratio ≥ 2 and ratio ≥ 3. **Residual:** ratio ∈ (1, 2) at large `b` still on BZ (~2.7× slower than Newton at ratio 1.5). - **Parse:** **1.56× at 20M** (best), widens to 1.92× at 50M as GMP's SSA path dominates. - **ToString:** 2.52× at 20M, narrowing from an 8.40× peak at 100k. - **BigDecimal division:** beats GMP at small target scales (0.20-1.00×) and at the 500 dp near-parity point (0.89×). @@ -460,6 +461,7 @@ PR #65 added MFA / Bailey 6-step routing inside `NTTMultiplicationCrt`. The defa | `BZ_DIVISOR_THRESHOLD` | 512 | 768 | Keep — current 512 already correct (BZ activates at b > 512; the 1024/512 tie is FastDiv noise) | | `NEWTON_MEDIUM_B` | 4096 | 8192 | Keep — current dispatch uses NEWTON_MEDIUM_B together with `a ≥ 3b`, all measured 8192-divisor cases at ratio ≥ 4 still win Newton | | `NEWTON_SKEW_NUMERATOR/DENOMINATOR` | 3/1 | 4/1 | Keep — bench rows from 5M×1M onward confirm Newton wins at ratio 5 (BigMath path) | +| `NEWTON_BALANCED_B` / ratio | 98304 / 2/1 | — | Added PR #79 — near-balanced band; below it BZ wins, above it BZ blows up 5–60× on non-pow2 sizes | | `NEWTON_HIGH_SKEW_*` | 2048 / 8/1 | (n/a, no rows) | Keep | **Toom-3 added to dispatch (2026-05-27, skew-gated 2026-05-30):** focused band scan (above, **Toom-3 dispatch band**) found Toom-3 wins by 4-10% in total ∈ [3 584, 4 096] and avoids a 33% NTT-length boundary regression at total 4 608. Dispatch slots Toom-3 between Karatsuba and NTT for total ∈ [2 560, 5 120), but only for non-skewed operands (`max < 2*min`). A skew-focused scan found 2:1+ shapes in that same window are faster through Karatsuba, with wins such as `2560x256` improving from 0.730 ms → 0.343 ms through `Multiply()`. `NTT_MULTIPLICATION_THRESHOLD` remains 5 120. Toom-5 still has no productive band — ties Karatsuba in 64-256 per-operand limbs and degrades sharply above 256, so stays excluded. diff --git a/CLAUDE.md b/CLAUDE.md index 7a8607f..8dabf9b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -40,13 +40,18 @@ CI (`.github/workflows/qa.yaml`) runs `nanoclaw-task qa-agent` on PRs against `o **Why no Schönhage-Strassen.** NTT here uses the Goldilocks prime `P = 2^64 - 2^32 + 1` with 16-bit input split. Convolution-accumulation overflow only at ~2^31 limbs (≈ 8 GB operands), so NTT covers every practical size. SSA's asymptotic edge is `log log n`; at any size below ~2^40 limbs the constant-factor cost of mod-2^N+1 arithmetic loses to NTT. The ~700+ lines of SSA aren't justified for this codebase — if you need more big-int wins, look at block-recursive Mulders division or Schönhage's subquadratic GCD instead. -**Division dispatch** (`algorithms/Division.h`): -- big-and-skewed (`a.size() > 2048 && a.size() > 3*b.size()`) → `BurnikelZieglerDivision` -- Newton band → `NewtonDivision` — Newton-Raphson reciprocal, O(M(n)). Handles arbitrary `na/nb` via internal blockwise mode (top chunk in [n+1, 2n], slide down by n, thread the remainder). Two dispatch bands: `b ≥ 24576` at ratio ≥ 4/3, or `b ≥ 16384` at ratio ≥ 2. Wins range from 4× (ratio 2) to 20× (ratio 4 at large sizes) vs FD/BZ. +**Division dispatch** (`algorithms/Division.h`, thresholds defined in `src/algorithms/Division.cpp`): +- `NewtonDivision` (Newton-Raphson reciprocal, O(M(n)); handles arbitrary `na/nb` via blockwise mode — top chunk in [n+1, 2n], slide down by n, thread the remainder) when any skew band holds: + - `b ≥ 4096` at ratio ≥ 3 (`NEWTON_SKEW` 3/1), or + - `b ≥ 98304` at ratio ≥ 2 (`NEWTON_BALANCED` 2/1 — the near-balanced band, PR #79), or + - `b ≥ 2048` at ratio ≥ 8 (`NEWTON_HIGH_SKEW` 8/1). +- else `BurnikelZieglerDivision` for power-of-two base when `b > 512` and the BZ band fits (near-balanced `b ≥ 1024, b+32 ≤ a ≤ 3b`, or big-and-skewed `a > 2048 && a > 3b`). - otherwise multi-limb → `FastDivision` (Knuth Algorithm D variant) - single-limb divisor → `ClassicDivision` - `KnuthDivision` and `ReciprocalDivision` are alternates used by correctness tests for cross-checking. +The balanced band exists because BZ's recursive 2n/n halving lands intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes, blowing up 5–60× vs Newton (worst at `n = 2^k+1`); Newton pads once and stays flat. **Known residual:** ratio ∈ (1, 2) at large `b` still routes to BZ and hits the same blowup (~2.7× slower than Newton would be at ratio 1.5) — the balanced band's `a ≥ 2b` lower bound doesn't cover it yet. + When adding a new algorithm, slot the implementation under `algorithms//.h`, then update the dispatch in `algorithms/.h` — the thresholds there are the only place size cutoffs live. **Tests cross-check by construction.** Correctness tests (`mult_correctness.cpp`, `div_correctness.cpp`) run every algorithm against classic/BurnikelZiegler and compare results limb-for-limb via `Compare(...)`; the division test also verifies the identity `q*b + r == a` and `r < b`. When changing an algorithm, run these — passing tests means the new code agrees with the reference path, not just that it ran. diff --git a/README.md b/README.md index 8a1a1f8..dba8844 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ Target: pure C++ that reaches parity or better than GMP in selected multiplicati - **Radix-4 + radix-8 fused NTT butterflies** (PRs #59, #60). Adjacent radix-2 layers collapse into single load/store butterflies — radix-4 fuses 2 layers (4 elements), radix-8 fuses 3 layers (8 elements). Same modular op count; 3× fewer memory passes vs radix-2. ~1.6× wall-clock at ≥2M limbs. - **MFA / Bailey 6-step CRT NTT** (`BIGMATH_NTT_MFA_THRESHOLD=2^24`, default). Very large CRT transforms switch to a cache-friendly matrix Fourier layout. The threshold was retuned upward from `2^21` to avoid regressions in the 300k-2M limb band while keeping wins at `2^24+` transform sizes. - **Multithreaded NTT** (`BIGMATH_USE_THREADS=1`, default). Small thread pool (size `min(hw_concurrency, BIGMATH_MAX_THREADS=8)`) parallelizes the CRT path: 6 forwards + 3 inverses as batched work units, one `ParallelDo` dispatch per phase. 2.3-3.4× speedup on large mul / skewed div / parse. Opt out via `-DBIGMATH_USE_THREADS=0` to drop pthread linkage. -- **Division:** Classic short division → Knuth Algorithm D (`FastDivision` with Möller-Granlund 3-by-2 qhat for Base2_32) → Burnikel–Ziegler (balanced large) → Newton–Raphson with reciprocal caching (skewed large). Identity `q·b + r == a` is cross-checked in `tests/div_correctness.cpp`. +- **Division:** Classic short division → Knuth Algorithm D (`FastDivision` with Möller-Granlund 3-by-2 qhat for Base2_32) → Burnikel–Ziegler (mid-size near-balanced) → Newton–Raphson with reciprocal caching (skewed large, plus near-balanced ratio ≥ 2 above 98304 limbs where BZ blows up 5–60× on non-power-of-2 divisor sizes — PR #79). Identity `q·b + r == a` is cross-checked in `tests/div_correctness.cpp`. - **Squaring:** Specialized Classic / Karatsuba / NTT squarers (1.4–1.6× over `Multiply(a,a)`). - **String I/O:** Linear chunked parser/formatter for small inputs, divide-and-conquer with cached Newton reciprocals at scale. Asymptotic `O(M(L) · log L)` both directions. - **BigDecimal:** Java-style fixed-point decimal (unscaled BigInteger + int scale) with exact +, −, \*; rounded division taking 8 rounding modes; parse/format covering plain and scientific notation. diff --git a/division_improvement_plan.md b/division_improvement_plan.md index ad12d58..271da69 100644 --- a/division_improvement_plan.md +++ b/division_improvement_plan.md @@ -57,6 +57,20 @@ Finding: Newton was entering too early for 1024-2048 limb divisors. The retune u - high-skew Newton: `b >= 2048` and `a >= 8b` - BZ: enabled for both `Base2_32` and `Base2_64` +## 3b. Near-Balanced Newton Band + BZ Non-Power-of-2 Blowup (PR #79) + +Status: completed. + +Profiling near-balanced (ratio ≈ 2) division at large `b` found two compounding bugs: + +1. **Newton single-block pathology.** The single-block path (`na ≤ 2n+1`) ran a `2n+1`-limb chunk through the truncated `(chunk·R) >> 2n` estimate; the error scales with `chunk / B^(2n)` and reaches ~`B` once the chunk exceeds `2n` limbs (the `+1` comes from the Knuth normalize shift on `a ≈ 2n`), overflowing the fixup cap and bailing to quadratic FastDivision (23× spike at `nb = 50000`). Fixed by routing `na > 2n` through the blockwise path. + +2. **BZ non-power-of-2 blowup.** BZ's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes; the FFT length doubles and the constant factor compounds across recursion depth into a **5–60× slowdown vs Newton**, worst at `n = 2^k+1` (≈90 s for a 262145-limb divisor vs Newton ~0.9 s). + +Fix: new Newton balanced band — `b >= NEWTON_BALANCED_B (98304)` and `a >= 2b` → Newton. Measured 8.8× at 500k/250k, 5.6× at 2M/1M, 97.7× at the 2¹⁸+1 worst case; exact-pow2 sizes (BZ best case) regress ~4%. Harness: `tests/performance/division_balanced_bench.cpp`. + +**Residual / next step:** ratio ∈ (1, 2) at large `b` still routes to BZ and hits the same blowup (~2.7× slower than Newton at ratio 1.5). Extending the balanced band below ratio 2 needs a quotient-bulk lower bound so genuinely tiny-quotient `a ≈ b` cases (where a full reciprocal is wasteful) stay on FastDivision. + ## 4. Prototype GMP-Style Pre-Inverted D&C Division Status: prototyped and rejected for now. diff --git a/docs/DIVISION.md b/docs/DIVISION.md index 6cf0801..f624727 100644 --- a/docs/DIVISION.md +++ b/docs/DIVISION.md @@ -62,7 +62,7 @@ flowchart TD C -- no --> D{Compare(a, b)} D -- a == b --> Eq[return {1, 0}] D -- a < b --> Less[return {0, a}] - D -- a > b --> E{Newton band?
b ≥ 4096 and a ≥ 3b
OR b ≥ 2048 and a ≥ 8b} + D -- a > b --> E{Newton band?
b ≥ 4096 and a ≥ 3b
OR b ≥ 98304 and a ≥ 2b
OR b ≥ 2048 and a ≥ 8b} E -- yes --> N[NewtonDivision] E -- no --> F{Power-of-two base
AND b.size > 512
AND BZ shape fits?} F -- yes --> BZ[BurnikelZieglerDivision] @@ -81,6 +81,9 @@ Default thresholds (overridable via `-D...`): | `BIGMATH_NEWTON_MEDIUM_B` | `4096` | limbs | Newton lower bound for medium-skew division | | `BIGMATH_NEWTON_SKEW_NUMERATOR` | `3` | — | medium-skew Newton requires `a.size() ≥ 3·b.size() / 1` | | `BIGMATH_NEWTON_SKEW_DENOMINATOR` | `1` | — | | +| `BIGMATH_NEWTON_BALANCED_B` | `98304` | limbs | Newton lower bound for the near-balanced band (PR #79) | +| `BIGMATH_NEWTON_BALANCED_NUMERATOR` | `2` | — | balanced Newton requires `a.size() ≥ 2·b.size() / 1` | +| `BIGMATH_NEWTON_BALANCED_DENOMINATOR` | `1` | — | | | `BIGMATH_NEWTON_HIGH_SKEW_B` | `2048` | limbs | lower bound for very-high-skew Newton | | `BIGMATH_NEWTON_HIGH_SKEW_NUMERATOR` | `8` | — | high-skew Newton requires `a.size() ≥ 8·b.size() / 1` | | `BIGMATH_NEWTON_HIGH_SKEW_DENOMINATOR` | `1` | — | | @@ -90,8 +93,9 @@ Default thresholds (overridable via `-D...`): The current dispatch logic, paraphrased: ``` -1. (b.size ≥ 4096 AND a.size ≥ 3·b.size) - OR (b.size ≥ 2048 AND a.size ≥ 8·b.size) → Newton +1. (b.size ≥ 4096 AND a.size ≥ 3·b.size) + OR (b.size ≥ 98304 AND a.size ≥ 2·b.size) ← near-balanced band (PR #79) + OR (b.size ≥ 2048 AND a.size ≥ 8·b.size) → Newton 2. (Base2_32 OR Base2_64) AND b.size > 512 AND ( (b.size ≥ 1024 AND a.size ≥ b.size + 32 AND a.size ≤ 3·b.size) OR (a.size > 2048 AND a.size > 3·b.size) ) → Burnikel–Ziegler @@ -99,7 +103,9 @@ The current dispatch logic, paraphrased: 4. (b.size == 1 inside FastDivision) → ClassicDivision ``` -The ordering matters: Newton wins on **large skewed** problems because the per-divisor reciprocal setup amortizes over multiple chunks. BZ wins on **near-balanced and mid-size skewed** problems where its 2n/n recursion structure beats both FastDivision and Newton's setup cost. FastDivision is the default workhorse for everything else. +The ordering matters: Newton wins on **large skewed** problems because the per-divisor reciprocal setup amortizes over multiple chunks. BZ wins on **mid-size near-balanced** problems where its 2n/n recursion structure beats both FastDivision and Newton's setup cost. + +**Near-balanced band (PR #79).** Above `NEWTON_BALANCED_B` (98304 limbs), ratio-≥2 division goes to Newton instead of BZ. BZ's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes — the FFT length doubles and the constant factor compounds across recursion depth into a **5–60× slowdown vs Newton**, worst at `n = 2^k + 1` (measured ~90 s for a 262145-limb divisor vs Newton's ~0.9 s). Newton pads once to the working size and stays flat. Exact-power-of-2 divisor sizes are BZ's best case (it ties Newton); they regress ~4 % under this band but are rare in practice. **Known residual:** ratio ∈ (1, 2) at large `b` still routes to BZ and hits the same blowup (~2.7× slower than Newton at ratio 1.5); the band's `a ≥ 2b` lower bound does not cover it. FastDivision is the default workhorse for everything else. `KnuthDivision` and `ReciprocalDivision` exist as alternate implementations used by correctness tests for cross-checking. They are not in the production dispatch path. @@ -245,7 +251,7 @@ Once `R` is computed, a single division `a / D` becomes: The implementation calls this `DivideChunk`. -**Blockwise mode for large dividends.** When `na > 2n + 1` (where `n = |b|`), the algorithm processes the dividend in chunks: +**Blockwise mode for large dividends.** When `na > 2n` (where `n = |b|`), the algorithm processes the dividend in chunks: ``` @@ -265,9 +271,9 @@ Each chunk costs O(M(n)) and there are O(na / n) chunks, so total cost is O((na 1. The reciprocal `R` is computed once and reused across all chunks. 2. Each chunk's `DivideChunk` is O(M(n)) thanks to NTT/Karatsuba in the high-half multiplication, not O(n · chunk_size). -**High-precision reciprocal flag.** When `na ≥ 2n`, the implementation runs one **extra Newton iteration at full precision** after the main loop (`EXTRA_REFINE_ITERS = 1`). This is necessary because integer rounding in the standard iteration leaves `R` accurate to only ~half-bits at large `n`, which combined with a dividend window of size ≥ 2n produces a `Q_estimate` error of `O(n)` — too large for the `FIXUP_LIMIT = 8` correction loop to absorb. The extra refinement iteration drops the error to ≤ 1 quotient digit, which the fixup loop handles trivially. +**Single-block boundary (PR #79).** The single-block path handles `na ≤ 2n`; `na > 2n` goes blockwise. The boundary is `2n`, not `2n+1`: with a `2n+1`-limb chunk the truncated `(chunk·R) >> 2n` quotient estimate underestimates `Q` by up to ~`B` steps (the error is `∝ chunk / B^(2n)`, which reaches `B` once the chunk exceeds `2n` limbs), overflowing the fixup cap and falling back to quadratic `FastDivision`. The `+1` limb routinely appears from the Knuth normalize shift on `a ≈ 2n`. Before this fix, `nb = 50000` spiked to ~5200 ms (vs ~360 ms at neighbouring sizes); routing `na > 2n` through the blockwise path keeps every chunk ≤ 2n and the fixup loop at 0–2 iterations. -Without this flag, the fixup loop diverged at `n = 32768` in early testing, leading to a fallback path that hands off to `FastDivision` if iterations exhaust. +**High-precision reciprocal flag.** When `na ≥ 2n`, the implementation also runs one **extra Newton iteration at full precision** after the main loop (`EXTRA_REFINE_ITERS = 1`), to drop integer-rounding error in `R` to ≤ 1 quotient digit. Without it the fixup loop diverged at `n = 32768` in early testing, handing off to `FastDivision`. **Why Newton, why bands.** The structural win of Newton over FastDivision is the O(M(n)) per-chunk cost versus FastDivision's O((m−n+1)·n) quadratic-in-quotient-size cost. The win materializes once the divisor is large enough that NTT/Karatsuba is faster than scalar multi-precision arithmetic, and once the ratio is skewed enough that reciprocal setup amortizes. The 2026-05 tuning lowered the dispatcher band from `b.size ≥ 8192` to `b.size ≥ 1024` with skew threshold `a ≥ 3b`, after a GMP-bench regression at `(a=200k, b=50k digits)` showed Newton was missing the band by 1–3 limbs at boundary cases.