Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 84 additions & 52 deletions BENCHMARK.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

Measured 2026-05-30 (fresh local rerun). Single-run snapshot capturing the state of `master` after the 2026-05 optimization pass (LIMB_64 default, multi-prime CRT NTT default, multithreaded NTT default, M-G div2by1 in `ClassicDivision`, M-G 3/2 qhat in `FastDivision` Base2_64, BZ Knuth-normalize fix, radix-4 + radix-8 fused NTT butterflies (PRs #59, #60), **Matrix Fourier Algorithm / Bailey 6-step CRT NTT retuned to n ≥ 2^24 coefficients**).

**Division, parse, and ToString tables re-measured 2026-06-11** after the division/dispatch session (PRs #82–#89: FastDivision bit-shift normalization, decimal D&C threshold retune, the wraparound-Newton family — cyclic QB remainder, top-limbs CR, invertappr reciprocal — plus the 4/3 balanced band and quotient-sized division). Multiplication tables are unchanged 2026-05-30 values (multiplication code untouched by that session). See the [2026-06-11 session section](#2026-06-11-division--decimal-io-session-prs-82-89) for the dispatch-shape wins the standard grids don't cover.

For per-subsystem deep dives — algorithms, dispatch, optimization history, rejected approaches — see:

- [docs/MULTIPLICATION.md](docs/MULTIPLICATION.md)
Expand Down Expand Up @@ -217,38 +219,37 @@ Balanced (`a.size() == b.size()`) — quotient is 1-2 limbs, both libraries shor

| size | BigMath ms | GMP ms | BM/GMP |
|---|---:|---:|---:|
| 1 000 × 1 000 | <0.001 | <0.001 | 10.90× |
| 5 000 × 5 000 | 0.001 | <0.001 | 6.41× |
| 10 000 × 10 000 | 0.002 | <0.001 | 6.55× |
| 50 000 × 50 000 | <0.001 | <0.001 | 0.43× |
| 100 000 × 100 000 | <0.001 | 0.001 | 0.23× |
| 500 000 × 500 000 | 0.001 | 0.005 | 0.18× |
| 1 000 000 × 1 000 000 | 0.001 | 0.010 | 0.10× |
| 5 000 000 × 5 000 000 | 1.367 | 0.189 | 7.23× |
| 1 000 × 1 000 | <0.001 | <0.001 | |
| 5 000 × 5 000 | 0.001 | <0.001 | |
| 10 000 × 10 000 | <0.001 | <0.001 | |
| 50 000 × 50 000 | 0.001 | <0.001 | 2.33× |
| 100 000 × 100 000 | 0.001 | 0.001 | 2.31× |
| 500 000 × 500 000 | 0.007 | 0.004 | 1.59× |
| 1 000 000 × 1 000 000 | 0.014 | 0.009 | 1.62× |
| 5 000 000 × 5 000 000 | 1.355 | 0.162 | 8.39× |

Skewed (`a.size() >> b.size()`) — Newton/BZ band, real algorithmic work:

| size | BigMath ms | GMP ms | BM/GMP |
|---|---:|---:|---:|
| 40 000 × 10 000 | 0.732 | 0.218 | 3.36× |
| 100 000 × 10 000 | 2.178 | 0.450 | 4.84× |
| 200 000 × 50 000 | 11.088 | 1.704 | 6.51× |
| 500 000 × 100 000 | 17.742 | 4.608 | 3.85× |
| 1 000 000 × 200 000 | 33.632 | 10.022 | 3.36× |
| 2 000 000 × 500 000 | 82.969 | 24.724 | 3.36× |
| 5 000 000 × 1 000 000 | 200.641 | 69.900 | 2.87× |
| 10 000 000 × 2 000 000 | 432.299 | 154.709 | 2.79× |
| 20 000 000 × 4 000 000 | 982.906 | 353.134 | 2.78× |
| 50 000 000 × 10 000 000 | 2 464.018 | 1 324.260 | 1.86× |
| 100 000 000 × 20 000 000 | 6 119.817 | 2 594.778 | 2.36× |
| **200 000 000 × 40 000 000** | **12 979.547** | **4 550.205** | **2.85×** ← MFA flowing through Newton |

**Observations:**

- **Skewed division ratio narrows from ~6.5× at 200k×50k peak to 1.86× at 50M×10M, then rises back to 2.85× at 200M×40M.** Newton inherits the multiplication wins in the large regime, including MFA once the internal products cross the useful MFA band. The residual gap is still division-structure overhead, not decimal I/O.
| 40 000 × 10 000 | 0.727 | 0.215 | 3.38× |
| 100 000 × 10 000 | 2.184 | 0.460 | 4.75× |
| 200 000 × 50 000 | 10.867 | 1.635 | 6.65× |
| **500 000 × 100 000** | **12.442** | **4.469** | **2.78×** ← was 3.85× |
| **1 000 000 × 200 000** | **22.599** | **9.723** | **2.32×** ← was 3.36× |
| **2 000 000 × 500 000** | **45.743** | **24.010** | **1.91×** ← was 3.36× |
| **5 000 000 × 1 000 000** | **111.407** | **70.057** | **1.59×** ← was 2.87× |
| **10 000 000 × 2 000 000** | **232.324** | **150.828** | **1.54×** ← was 2.79× |
| **20 000 000 × 4 000 000** | **519.977** | **344.471** | **1.51×** ← was 2.78× |
| **50 000 000 × 10 000 000** | **1 461.269** | **1 297.461** | **1.13×** ← was 1.86× |
| 100 000 000 × 20 000 000 | 6 119.817 | 2 594.778 | 2.36× (2026-05-30, not re-measured) |
| 200 000 000 × 40 000 000 | 12 979.547 | 4 550.205 | 2.85× (2026-05-30, not re-measured) |

**Observations (updated 2026-06-11):**

- **The wraparound-Newton family (PRs #85–#87) cut the skewed band roughly in half: 500k×100k through 50M×10M now sit at 1.13–2.78× vs GMP, from 1.86–3.85× before.** 5M×1M went 200.6 → 111.4 ms; 50M×10M is at **1.13×, near parity**. The cuts shorten the serial transform chain (cyclic mod-B^L−1 products at half length, top-limbs quotient estimates), which is the lever that converts to wall-clock on the threaded stack.
- 200k×50k stays the worst point: divisor sits below the Newton band (2596 limbs), goes through BZ which loses ~6.5× to GMP's `mpn_dcpi1_div_q` at this size.
- Residual 2.8-3.9× gap in the 1M-20M skewed band is the structural cost of Newton's chunked iteration vs GMP's single recursive divide with precomputed inverse.
- Balanced cases route through FastDivision short-circuits and aren't algorithmically meaningful at this size profile. The 5M×5M balanced case was regressing 27.03× before PR #56 fix (BZ misroute on degenerate quotient); now 7.20× via FastDivision short-circuit.
- Balanced equal-size rows are degenerate (quotient 0–1 limbs, both libraries short-circuit; sub-15µs absolute through 1M digits). FastDivision's scalar-normalization fix (PR #82, bit-shift normalize, 3.5× on that path) shows up in driver benchmarks rather than these noise-level rows.

### Shape-focused division dispatch

Expand Down Expand Up @@ -311,6 +312,37 @@ limb-for-limb across `na ∈ {2n-1, 2n, 2n+1, 2n+2, 3n}` × `nb` near the bounda

---

## 2026-06-11 division & decimal-IO session (PRs #82-#89)

One-day profiling-driven pass over division and decimal I/O. Full analysis and rejection notes in
[docs/DIVISION.md](docs/DIVISION.md) and [docs/STRING_CONVERSION.md](docs/STRING_CONVERSION.md);
this is the headline summary. The refreshed division/parse/ToString tables above are the post-session state.

| PR | change | headline win |
|---|---|---|
| #82 | FastDivision bit-shift normalization (replaces scalar-d normalize; kills per-limb `__udivmodti4` remainder denorm) | balanced div 1M digits 0.89 → 0.25 ms (3.5×) |
| #83 | decimal D&C thresholds: parse 8192 → 2048, tostr 2048 → 1024 (May sweep inverted by the radix-8/MFA/Newton changes) | parse −12…−53% (4k–1M digits), tostr −19…−33% (1.5k–10k) |
| #84 | doc-only: prepared-transform reuse in `DivideChunk` REJECTED — cut CPU 11% but wall-clock flat (the 6 forward transforms already run concurrently; only serial-chain cuts convert to wall-clock on the threaded stack) | — |
| #85 | cyclic wrap-around remainder: `Q·b_norm` replaced by its residue mod `B^L−1` at half transform length | skewed div −10–11%, tostr −7% |
| #86 | top-limbs quotient estimate (GMP `mu_divappr` style): multiply only top n+1 chunk limbs against R | skewed div −20% at transform-boundary sizes, tostr −11% |
| #87 | invertappr-style wrapped Newton iteration in `ApproxReciprocal` (exact E from cyclic residue + top-slice correction) | one-shot skewed div −24–25%, `Divider` setup −29% |
| #88 | Newton balanced band ratio 2/1 → 4/3 (post-#85–87 crossover re-measured) | nb=131073 limbs ratio 1.5: 10.7 s → 157 ms |
| #89 | **quotient-sized division** (`QuotientSizedDivision.h`): divides operand tops for short quotients; cost scales with quotient, not divisor; balanced-band floor 98304 → 24576 limbs | 2^k+1 pathology ratios 1.05–1.25: 1.07–5.35 s → 28–71 ms (38–75×) |

Cumulative on the standard grid: skewed div 5M×1M 200.6 → 111.4 ms (2.87× → 1.59× vs GMP),
50M×10M at **1.13× — near parity**; tostr 1M 224.8 → 170.2 ms; parse 1M 49.0 → 43.3 ms.

Division dispatch now has no known residual shape holes: ratio ∈ (1, 4/3) at b ≥ 24576 limbs routes to
quotient-sized division (which also beats BZ at BZ's exact-power-of-2 best case), 4/3 ≤ ratio bands route
to Newton, small/degenerate shapes keep FastDivision.

Two recorded process lessons: profile-sample arithmetic overcounts parallel-overlapped work (CPU-time
savings ≠ wall-clock savings — see PR #84's rejection), and the exact Newton tower drifts up to ~B^6 ulps,
so any residue-window sizing needs ~B^16 headroom (a B^4 window caused an 8× ToString regression via
silent FastDivision fallbacks before being caught with a `10^500000`-divisor repro).

---

## In-tree simple harnesses (k-averaged, no GMP)

`tests/multperf_simple.cpp` and `tests/divperf_simple.cpp` are the standalone
Expand Down Expand Up @@ -365,39 +397,39 @@ BZ's recursive halving pays off on the near-balanced `4096 × 2048` shape

| size (digits) | BigMath ms | GMP ms | BM/GMP |
|---|---:|---:|---:|
| 1 000 | 0.002 | 0.002 | 1.50× |
| 10 000 | 0.115 | 0.038 | 3.01× |
| 50 000 | 1.375 | 0.387 | 3.43× |
| 100 000 | 3.263 | 1.047 | 3.10× |
| 500 000 | 22.262 | 9.270 | 2.40× |
| 1 000 000 | 49.037 | 21.071 | 2.33× |
| 2 000 000 | 106.786 | 48.047 | 2.22× |
| 5 000 000 | 269.753 | 150.835 | 1.79× |
| 10 000 000 | 585.600 | 350.776 | 1.67× |
| 20 000 000 | 1 270.039 | 814.355 | 1.56× |
| 50 000 000 | 5 168.943 | 2 689.193 | 1.92× |

**Observation:** ratio narrows through the 10M-20M sweet spot (3.2× at 100k → **1.56× at 20M**) where BigMath's NTT overtakes GMP's basecase. It widens back to 1.92× at 50M as GMP's SSA activates.
| 1 000 | 0.002 | 0.002 | 1.51× |
| 10 000 | 0.085 | 0.038 | 2.24× |
| 50 000 | 1.182 | 0.397 | 2.98× |
| 100 000 | 2.867 | 1.051 | 2.73× |
| 500 000 | 19.369 | 8.869 | 2.18× |
| 1 000 000 | 43.339 | 20.552 | 2.11× |
| 2 000 000 | 95.944 | 47.272 | 2.03× |
| 5 000 000 | 253.822 | 148.577 | 1.71× |
| 10 000 000 | 552.055 | 345.782 | 1.60× |
| 20 000 000 | 1 223.549 | 813.621 | 1.50× |
| 50 000 000 | 5 236.669 | 2 691.178 | 1.95× |

**Observation (updated 2026-06-11):** the `DecimalDcThreshold` retune (8192 → 2048, PR #83) shaved 10–26% through the 10k–2M band; ratio narrows through the 10M-20M sweet spot (2.7× at 100k → **1.50× at 20M**) where BigMath's NTT overtakes GMP's basecase. It widens back to 1.95× at 50M as GMP's SSA activates.

---

## ToString (BigInteger → string)

| size (digits) | BigMath ms | GMP ms | BM/GMP |
|---|---:|---:|---:|
| 1 000 | 0.007 | 0.004 | 1.82× |
| 10 000 | 0.274 | 0.078 | 3.50× |
| 50 000 | 4.044 | 0.861 | 4.70× |
| 100 000 | 19.803 | 2.357 | 8.40× |
| 200 000 | 40.760 | 6.417 | 6.35× |
| 500 000 | 106.486 | 20.667 | 5.15× |
| 1 000 000 | 224.755 | 50.174 | 4.48× |
| 2 000 000 | 481.526 | 119.342 | 4.03× |
| 5 000 000 | 1 104.781 | 386.146 | 2.86× |
| 10 000 000 | 2 416.172 | 913.741 | 2.64× |
| 20 000 000 | 5 432.991 | 2 155.828 | 2.52× |

**Observation:** narrowest gap at 1k (the linear leaf, where GM div2by1 already runs); peaks at 100k (D&C overhead + Newton recip setup not yet amortized); narrows again from 200k onward as D&C asymptotic + NTT inheriting from multiplication's overtake compound — **8.09× at 100k → 2.57× at 20M**.
| 1 000 | 0.006 | 0.003 | 1.82× |
| 10 000 | 0.261 | 0.078 | 3.35× |
| 50 000 | 3.916 | 0.860 | 4.56× |
| **100 000** | **17.608** | **2.388** | **7.37×** ← was 8.40× |
| **200 000** | **33.585** | **6.131** | **5.48×** ← was 6.35× |
| **500 000** | **84.371** | **20.667** | **4.08×** ← was 5.15× |
| **1 000 000** | **170.229** | **51.464** | **3.31×** ← was 4.48× |
| **2 000 000** | **352.228** | **121.325** | **2.90×** ← was 4.03× |
| **5 000 000** | **815.351** | **385.973** | **2.11×** ← was 2.86× |
| **10 000 000** | **1 732.746** | **925.173** | **1.87×** ← was 2.64× |
| **20 000 000** | **3 785.269** | **2 127.186** | **1.78×** ← was 2.52× |

**Observation (updated 2026-06-11):** ToString is divider-chain-bound, so it compounds the session's division wins (cyclic QB + top-limbs CR in `DivideChunk`, invertappr reciprocal in chain setup, `BIGMATH_TOSTR_DC_THRESHOLD` 2048 → 1024): 1M digits went 224.8 → 170.2 ms, 20M went 5 433 → 3 785 ms. Gap still peaks around 100k (D&C + chain setup not yet amortized) and narrows to **1.78× at 20M**. Methodology matches the original table: warm best-of-N below 100k, single cold run (chain build included) at 100k+.

### ToString focused warm benchmark

Expand Down
Loading