Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions BENCHMARK.md
Original file line number Diff line number Diff line change
Expand Up @@ -437,6 +437,7 @@ ToString (`BigDecimal → string`):
- **Multiplication skewed:** parity around 500k×50k and 1M×100k, with a slight BigMath lead at 2M×200k. BigMath falls behind at 50M×5M, 100M×10M, and 200M×20M.
- **Division skewed:** 1.86×-6.51× behind GMP in the main band, with the new 200M×40M row at 2.85×. Worst remains 200k×50k (BZ band). The large-multiplication speedups still flow through Newton, but the structure overhead never disappears.
- **Division balanced 5M×5M:** PR #56 fix routes degenerate-quotient cases to FastDivision; 27.03× → 7.20×.
- **Division near-balanced (ratio ≈ 2, large `b`):** PR #79 added a Newton balanced band (`b ≥ 98304, a ≥ 2b`) that dodges BZ's 5–60× non-power-of-2 blowup — 8.8× at 500k/250k, 97.7× at the 2¹⁸+1 worst case. Division is now NTT-bound for ratio ≥ 2 and ratio ≥ 3. **Residual:** ratio ∈ (1, 2) at large `b` still on BZ (~2.7× slower than Newton at ratio 1.5).
- **Parse:** **1.56× at 20M** (best), widens to 1.92× at 50M as GMP's SSA path dominates.
- **ToString:** 2.52× at 20M, narrowing from an 8.40× peak at 100k.
- **BigDecimal division:** beats GMP at small target scales (0.20-1.00×) and at the 500 dp near-parity point (0.89×).
Expand All @@ -460,6 +461,7 @@ PR #65 added MFA / Bailey 6-step routing inside `NTTMultiplicationCrt`. The defa
| `BZ_DIVISOR_THRESHOLD` | 512 | 768 | Keep — current 512 already correct (BZ activates at b > 512; the 1024/512 tie is FastDiv noise) |
| `NEWTON_MEDIUM_B` | 4096 | 8192 | Keep — current dispatch uses NEWTON_MEDIUM_B together with `a ≥ 3b`, all measured 8192-divisor cases at ratio ≥ 4 still win Newton |
| `NEWTON_SKEW_NUMERATOR/DENOMINATOR` | 3/1 | 4/1 | Keep — bench rows from 5M×1M onward confirm Newton wins at ratio 5 (BigMath path) |
| `NEWTON_BALANCED_B` / ratio | 98304 / 2/1 | — | Added PR #79 — near-balanced band; below it BZ wins, above it BZ blows up 5–60× on non-pow2 sizes |
| `NEWTON_HIGH_SKEW_*` | 2048 / 8/1 | (n/a, no rows) | Keep |

**Toom-3 added to dispatch (2026-05-27, skew-gated 2026-05-30):** focused band scan (above, **Toom-3 dispatch band**) found Toom-3 wins by 4-10% in total ∈ [3 584, 4 096] and avoids a 33% NTT-length boundary regression at total 4 608. Dispatch slots Toom-3 between Karatsuba and NTT for total ∈ [2 560, 5 120), but only for non-skewed operands (`max < 2*min`). A skew-focused scan found 2:1+ shapes in that same window are faster through Karatsuba, with wins such as `2560x256` improving from 0.730 ms → 0.343 ms through `Multiply()`. `NTT_MULTIPLICATION_THRESHOLD` remains 5 120. Toom-5 still has no productive band — ties Karatsuba in 64-256 per-operand limbs and degrades sharply above 256, so stays excluded.
11 changes: 8 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,18 @@ CI (`.github/workflows/qa.yaml`) runs `nanoclaw-task qa-agent` on PRs against `o

**Why no Schönhage-Strassen.** NTT here uses the Goldilocks prime `P = 2^64 - 2^32 + 1` with 16-bit input split. Convolution-accumulation overflow only at ~2^31 limbs (≈ 8 GB operands), so NTT covers every practical size. SSA's asymptotic edge is `log log n`; at any size below ~2^40 limbs the constant-factor cost of mod-2^N+1 arithmetic loses to NTT. The ~700+ lines of SSA aren't justified for this codebase — if you need more big-int wins, look at block-recursive Mulders division or Schönhage's subquadratic GCD instead.

**Division dispatch** (`algorithms/Division.h`):
- big-and-skewed (`a.size() > 2048 && a.size() > 3*b.size()`) → `BurnikelZieglerDivision`
- Newton band → `NewtonDivision` — Newton-Raphson reciprocal, O(M(n)). Handles arbitrary `na/nb` via internal blockwise mode (top chunk in [n+1, 2n], slide down by n, thread the remainder). Two dispatch bands: `b ≥ 24576` at ratio ≥ 4/3, or `b ≥ 16384` at ratio ≥ 2. Wins range from 4× (ratio 2) to 20× (ratio 4 at large sizes) vs FD/BZ.
**Division dispatch** (`algorithms/Division.h`, thresholds defined in `src/algorithms/Division.cpp`):
- `NewtonDivision` (Newton-Raphson reciprocal, O(M(n)); handles arbitrary `na/nb` via blockwise mode — top chunk in [n+1, 2n], slide down by n, thread the remainder) when any skew band holds:
- `b ≥ 4096` at ratio ≥ 3 (`NEWTON_SKEW` 3/1), or
- `b ≥ 98304` at ratio ≥ 2 (`NEWTON_BALANCED` 2/1 — the near-balanced band, PR #79), or
- `b ≥ 2048` at ratio ≥ 8 (`NEWTON_HIGH_SKEW` 8/1).
- else `BurnikelZieglerDivision` for power-of-two base when `b > 512` and the BZ band fits (near-balanced `b ≥ 1024, b+32 ≤ a ≤ 3b`, or big-and-skewed `a > 2048 && a > 3b`).
- otherwise multi-limb → `FastDivision` (Knuth Algorithm D variant)
- single-limb divisor → `ClassicDivision`
- `KnuthDivision` and `ReciprocalDivision` are alternates used by correctness tests for cross-checking.

The balanced band exists because BZ's recursive 2n/n halving lands intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes, blowing up 5–60× vs Newton (worst at `n = 2^k+1`); Newton pads once and stays flat. **Known residual:** ratio ∈ (1, 2) at large `b` still routes to BZ and hits the same blowup (~2.7× slower than Newton would be at ratio 1.5) — the balanced band's `a ≥ 2b` lower bound doesn't cover it yet.

When adding a new algorithm, slot the implementation under `algorithms/<op>/<Name>.h`, then update the dispatch in `algorithms/<Op>.h` — the thresholds there are the only place size cutoffs live.

**Tests cross-check by construction.** Correctness tests (`mult_correctness.cpp`, `div_correctness.cpp`) run every algorithm against classic/BurnikelZiegler and compare results limb-for-limb via `Compare(...)`; the division test also verifies the identity `q*b + r == a` and `r < b`. When changing an algorithm, run these — passing tests means the new code agrees with the reference path, not just that it ran.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Target: pure C++ that reaches parity or better than GMP in selected multiplicati
- **Radix-4 + radix-8 fused NTT butterflies** (PRs #59, #60). Adjacent radix-2 layers collapse into single load/store butterflies — radix-4 fuses 2 layers (4 elements), radix-8 fuses 3 layers (8 elements). Same modular op count; 3× fewer memory passes vs radix-2. ~1.6× wall-clock at ≥2M limbs.
- **MFA / Bailey 6-step CRT NTT** (`BIGMATH_NTT_MFA_THRESHOLD=2^24`, default). Very large CRT transforms switch to a cache-friendly matrix Fourier layout. The threshold was retuned upward from `2^21` to avoid regressions in the 300k-2M limb band while keeping wins at `2^24+` transform sizes.
- **Multithreaded NTT** (`BIGMATH_USE_THREADS=1`, default). Small thread pool (size `min(hw_concurrency, BIGMATH_MAX_THREADS=8)`) parallelizes the CRT path: 6 forwards + 3 inverses as batched work units, one `ParallelDo` dispatch per phase. 2.3-3.4× speedup on large mul / skewed div / parse. Opt out via `-DBIGMATH_USE_THREADS=0` to drop pthread linkage.
- **Division:** Classic short division → Knuth Algorithm D (`FastDivision` with Möller-Granlund 3-by-2 qhat for Base2_32) → Burnikel–Ziegler (balanced large) → Newton–Raphson with reciprocal caching (skewed large). Identity `q·b + r == a` is cross-checked in `tests/div_correctness.cpp`.
- **Division:** Classic short division → Knuth Algorithm D (`FastDivision` with Möller-Granlund 3-by-2 qhat for Base2_32) → Burnikel–Ziegler (mid-size near-balanced) → Newton–Raphson with reciprocal caching (skewed large, plus near-balanced ratio ≥ 2 above 98304 limbs where BZ blows up 5–60× on non-power-of-2 divisor sizes — PR #79). Identity `q·b + r == a` is cross-checked in `tests/div_correctness.cpp`.
- **Squaring:** Specialized Classic / Karatsuba / NTT squarers (1.4–1.6× over `Multiply(a,a)`).
- **String I/O:** Linear chunked parser/formatter for small inputs, divide-and-conquer with cached Newton reciprocals at scale. Asymptotic `O(M(L) · log L)` both directions.
- **BigDecimal:** Java-style fixed-point decimal (unscaled BigInteger + int scale) with exact +, −, \*; rounded division taking 8 rounding modes; parse/format covering plain and scientific notation.
Expand Down
14 changes: 14 additions & 0 deletions division_improvement_plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,20 @@ Finding: Newton was entering too early for 1024-2048 limb divisors. The retune u
- high-skew Newton: `b >= 2048` and `a >= 8b`
- BZ: enabled for both `Base2_32` and `Base2_64`

## 3b. Near-Balanced Newton Band + BZ Non-Power-of-2 Blowup (PR #79)

Status: completed.

Profiling near-balanced (ratio ≈ 2) division at large `b` found two compounding bugs:

1. **Newton single-block pathology.** The single-block path (`na ≤ 2n+1`) ran a `2n+1`-limb chunk through the truncated `(chunk·R) >> 2n` estimate; the error scales with `chunk / B^(2n)` and reaches ~`B` once the chunk exceeds `2n` limbs (the `+1` comes from the Knuth normalize shift on `a ≈ 2n`), overflowing the fixup cap and bailing to quadratic FastDivision (23× spike at `nb = 50000`). Fixed by routing `na > 2n` through the blockwise path.

2. **BZ non-power-of-2 blowup.** BZ's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes; the FFT length doubles and the constant factor compounds across recursion depth into a **5–60× slowdown vs Newton**, worst at `n = 2^k+1` (≈90 s for a 262145-limb divisor vs Newton ~0.9 s).

Fix: new Newton balanced band — `b >= NEWTON_BALANCED_B (98304)` and `a >= 2b` → Newton. Measured 8.8× at 500k/250k, 5.6× at 2M/1M, 97.7× at the 2¹⁸+1 worst case; exact-pow2 sizes (BZ best case) regress ~4%. Harness: `tests/performance/division_balanced_bench.cpp`.

**Residual / next step:** ratio ∈ (1, 2) at large `b` still routes to BZ and hits the same blowup (~2.7× slower than Newton at ratio 1.5). Extending the balanced band below ratio 2 needs a quotient-bulk lower bound so genuinely tiny-quotient `a ≈ b` cases (where a full reciprocal is wasteful) stay on FastDivision.

## 4. Prototype GMP-Style Pre-Inverted D&C Division

Status: prototyped and rejected for now.
Expand Down
20 changes: 13 additions & 7 deletions docs/DIVISION.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ flowchart TD
C -- no --> D{Compare&#40;a, b&#41;}
D -- a == b --> Eq[return &#123;1, 0&#125;]
D -- a &lt; b --> Less[return &#123;0, a&#125;]
D -- a &gt; b --> E{Newton band?<br/>b ≥ 4096 and a ≥ 3b<br/>OR b ≥ 2048 and a ≥ 8b}
D -- a &gt; b --> E{Newton band?<br/>b ≥ 4096 and a ≥ 3b<br/>OR b ≥ 98304 and a ≥ 2b<br/>OR b ≥ 2048 and a ≥ 8b}
E -- yes --> N[NewtonDivision]
E -- no --> F{Power-of-two base<br/>AND b.size &gt; 512<br/>AND BZ shape fits?}
F -- yes --> BZ[BurnikelZieglerDivision]
Expand All @@ -81,6 +81,9 @@ Default thresholds (overridable via `-D...`):
| `BIGMATH_NEWTON_MEDIUM_B` | `4096` | limbs | Newton lower bound for medium-skew division |
| `BIGMATH_NEWTON_SKEW_NUMERATOR` | `3` | — | medium-skew Newton requires `a.size() ≥ 3·b.size() / 1` |
| `BIGMATH_NEWTON_SKEW_DENOMINATOR` | `1` | — | |
| `BIGMATH_NEWTON_BALANCED_B` | `98304` | limbs | Newton lower bound for the near-balanced band (PR #79) |
| `BIGMATH_NEWTON_BALANCED_NUMERATOR` | `2` | — | balanced Newton requires `a.size() ≥ 2·b.size() / 1` |
| `BIGMATH_NEWTON_BALANCED_DENOMINATOR` | `1` | — | |
| `BIGMATH_NEWTON_HIGH_SKEW_B` | `2048` | limbs | lower bound for very-high-skew Newton |
| `BIGMATH_NEWTON_HIGH_SKEW_NUMERATOR` | `8` | — | high-skew Newton requires `a.size() ≥ 8·b.size() / 1` |
| `BIGMATH_NEWTON_HIGH_SKEW_DENOMINATOR` | `1` | — | |
Expand All @@ -90,16 +93,19 @@ Default thresholds (overridable via `-D...`):
The current dispatch logic, paraphrased:

```
1. (b.size ≥ 4096 AND a.size ≥ 3·b.size)
OR (b.size ≥ 2048 AND a.size ≥ 8·b.size) → Newton
1. (b.size ≥ 4096 AND a.size ≥ 3·b.size)
OR (b.size ≥ 98304 AND a.size ≥ 2·b.size) ← near-balanced band (PR #79)
OR (b.size ≥ 2048 AND a.size ≥ 8·b.size) → Newton
2. (Base2_32 OR Base2_64) AND b.size > 512 AND
( (b.size ≥ 1024 AND a.size ≥ b.size + 32 AND a.size ≤ 3·b.size)
OR (a.size > 2048 AND a.size > 3·b.size) ) → Burnikel–Ziegler
3. else → FastDivision
4. (b.size == 1 inside FastDivision) → ClassicDivision
```

The ordering matters: Newton wins on **large skewed** problems because the per-divisor reciprocal setup amortizes over multiple chunks. BZ wins on **near-balanced and mid-size skewed** problems where its 2n/n recursion structure beats both FastDivision and Newton's setup cost. FastDivision is the default workhorse for everything else.
The ordering matters: Newton wins on **large skewed** problems because the per-divisor reciprocal setup amortizes over multiple chunks. BZ wins on **mid-size near-balanced** problems where its 2n/n recursion structure beats both FastDivision and Newton's setup cost.

**Near-balanced band (PR #79).** Above `NEWTON_BALANCED_B` (98304 limbs), ratio-≥2 division goes to Newton instead of BZ. BZ's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes — the FFT length doubles and the constant factor compounds across recursion depth into a **5–60× slowdown vs Newton**, worst at `n = 2^k + 1` (measured ~90 s for a 262145-limb divisor vs Newton's ~0.9 s). Newton pads once to the working size and stays flat. Exact-power-of-2 divisor sizes are BZ's best case (it ties Newton); they regress ~4 % under this band but are rare in practice. **Known residual:** ratio ∈ (1, 2) at large `b` still routes to BZ and hits the same blowup (~2.7× slower than Newton at ratio 1.5); the band's `a ≥ 2b` lower bound does not cover it. FastDivision is the default workhorse for everything else.

`KnuthDivision` and `ReciprocalDivision` exist as alternate implementations used by correctness tests for cross-checking. They are not in the production dispatch path.

Expand Down Expand Up @@ -245,7 +251,7 @@ Once `R` is computed, a single division `a / D` becomes:

The implementation calls this `DivideChunk`.

**Blockwise mode for large dividends.** When `na > 2n + 1` (where `n = |b|`), the algorithm processes the dividend in chunks:
**Blockwise mode for large dividends.** When `na > 2n` (where `n = |b|`), the algorithm processes the dividend in chunks:

```

Expand All @@ -265,9 +271,9 @@ Each chunk costs O(M(n)) and there are O(na / n) chunks, so total cost is O((na
1. The reciprocal `R` is computed once and reused across all chunks.
2. Each chunk's `DivideChunk` is O(M(n)) thanks to NTT/Karatsuba in the high-half multiplication, not O(n · chunk_size).

**High-precision reciprocal flag.** When `na 2n`, the implementation runs one **extra Newton iteration at full precision** after the main loop (`EXTRA_REFINE_ITERS = 1`). This is necessary because integer rounding in the standard iteration leaves `R` accurate to only ~half-bits at large `n`, which combined with a dividend window of size ≥ 2n produces a `Q_estimate` error of `O(n)` — too large for the `FIXUP_LIMIT = 8` correction loop to absorb. The extra refinement iteration drops the error to ≤ 1 quotient digit, which the fixup loop handles trivially.
**Single-block boundary (PR #79).** The single-block path handles `na 2n`; `na > 2n` goes blockwise. The boundary is `2n`, not `2n+1`: with a `2n+1`-limb chunk the truncated `(chunk·R) >> 2n` quotient estimate underestimates `Q` by up to ~`B` steps (the error is `∝ chunk / B^(2n)`, which reaches `B` once the chunk exceeds `2n` limbs), overflowing the fixup cap and falling back to quadratic `FastDivision`. The `+1` limb routinely appears from the Knuth normalize shift on `a ≈ 2n`. Before this fix, `nb = 50000` spiked to ~5200 ms (vs ~360 ms at neighbouring sizes); routing `na > 2n` through the blockwise path keeps every chunk ≤ 2n and the fixup loop at 0–2 iterations.

Without this flag, the fixup loop diverged at `n = 32768` in early testing, leading to a fallback path that hands off to `FastDivision` if iterations exhaust.
**High-precision reciprocal flag.** When `na ≥ 2n`, the implementation also runs one **extra Newton iteration at full precision** after the main loop (`EXTRA_REFINE_ITERS = 1`), to drop integer-rounding error in `R` to ≤ 1 quotient digit. Without it the fixup loop diverged at `n = 32768` in early testing, handing off to `FastDivision`.

**Why Newton, why bands.** The structural win of Newton over FastDivision is the O(M(n)) per-chunk cost versus FastDivision's O((m−n+1)·n) quadratic-in-quotient-size cost. The win materializes once the divisor is large enough that NTT/Karatsuba is faster than scalar multi-precision arithmetic, and once the ratio is skewed enough that reciprocal setup amortizes. The 2026-05 tuning lowered the dispatcher band from `b.size ≥ 8192` to `b.size ≥ 1024` with skew threshold `a ≥ 3b`, after a GMP-bench regression at `(a=200k, b=50k digits)` showed Newton was missing the band by 1–3 limbs at boundary cases.

Expand Down
Loading