mmurshed · mmurshed · May 30, 2026
diff --git a/BENCHMARK.md b/BENCHMARK.md
@@ -41,36 +41,36 @@ Balanced (`a.size() == b.size()`):
 | 50 000 × 50 000 | 0.698 | 0.398 | 1.76× |
 | 100 000 × 100 000 | 1.144 | 0.709 | 1.61× |
 | 500 000 × 500 000 | 5.120 | 4.322 | 1.18× |
-| 1 000 000 × 1 000 000 | 10.176 | 8.845 | 1.15× |
-| **2 000 000 × 2 000 000** | **21.570** | **20.994** | **1.03×** ← near parity |
-| **5 000 000 × 5 000 000** | **48.947** | **65.245** | **0.75×** ← BigMath faster |
-| **10 000 000 × 10 000 000** | **106.856** | **209.888** | **0.51×** ← BigMath 1.96× faster |
-| **20 000 000 × 20 000 000** | **275.252** | **278.298** | **0.99×** ← parity |
-| **50 000 000 × 50 000 000** | **1 248.374** | **674.945** | **1.85×** ← GMP faster |
-| **100 000 000 × 100 000 000** | **2 617.801** | **1 453.331** | **1.80×** ← GMP faster |
-| **200 000 000 × 200 000 000** | **6 578.614** | **3 049.051** | **2.16×** ← GMP faster |
+| 1 000 000 × 1 000 000 | 10.652 | 9.008 | 1.18× |
+| **2 000 000 × 2 000 000** | **21.264** | **20.501** | **1.04×** ← near parity |
+| **5 000 000 × 5 000 000** | **47.249** | **62.875** | **0.75×** ← BigMath faster |
+| **10 000 000 × 10 000 000** | **102.150** | **209.436** | **0.49×** ← BigMath 2.05× faster |
+| **20 000 000 × 20 000 000** | **306.081** | **272.170** | **1.12×** ← GMP faster |
+| **50 000 000 × 50 000 000** | **1 239.154** | **668.975** | **1.85×** ← GMP faster |
+| **100 000 000 × 100 000 000** | **2 818.366** | **1 480.842** | **1.90×** ← GMP faster |
+| **200 000 000 × 200 000 000** | **6 198.901** | **3 093.118** | **2.00×** ← GMP faster |
 
 Skewed (`a.size() >> b.size()`):
 
 | size | BigMath ms | GMP ms | BM/GMP |
 |---|---:|---:|---:|
-| 100 000 × 10 000 | 0.640 | 0.303 | 2.11× |
-| 500 000 × 50 000 | 2.168 | 2.135 | 1.02× |
-| **1 000 000 × 100 000** | **4.541** | **4.576** | **0.99×** ← BigMath faster |
-| **2 000 000 × 200 000** | **9.238** | **9.479** | **0.97×** ← BigMath faster |
-| 5 000 000 × 500 000 | 39.149 | 30.829 | 1.27× |
-| 10 000 000 × 1 000 000 | 90.079 | 70.716 | 1.27× |
-| 20 000 000 × 2 000 000 | 239.615 | 162.845 | 1.47× |
-| **50 000 000 × 5 000 000** | **615.801** | **684.121** | **0.90×** ← BigMath faster |
-| 100 000 000 × 10 000 000 | 1 159.229 | 1 011.270 | 1.15× |
-| 200 000 000 × 20 000 000 | 2 446.578 | 1 952.674 | 1.25× |
+| 100 000 × 10 000 | 0.543 | 0.305 | 1.78× |
+| 500 000 × 50 000 | 2.208 | 2.108 | 1.05× |
+| **1 000 000 × 100 000** | **4.301** | **4.518** | **0.95×** ← BigMath faster |
+| **2 000 000 × 200 000** | **9.367** | **9.360** | **1.00×** ← parity |
+| 5 000 000 × 500 000 | 38.681 | 30.543 | 1.27× |
+| 10 000 000 × 1 000 000 | 97.745 | 69.966 | 1.40× |
+| 20 000 000 × 2 000 000 | 243.961 | 160.780 | 1.52× |
+| **50 000 000 × 5 000 000** | **543.376** | **698.909** | **0.78×** ← BigMath faster |
+| 100 000 000 × 10 000 000 | 1 161.182 | 1 012.887 | 1.15× |
+| 200 000 000 × 20 000 000 | 2 505.058 | 1 980.750 | 1.26× |
 
 **Observations:**
 
-- **BigMath beats GMP on balanced multiplication across the 5M-10M band and is near parity at 20M.** Radix-4 + radix-8 fused NTT butterflies (PRs #59, #60) added 1.5-1.6× wall-clock vs prior. The 2026-05-27 MFA retune moved the gate from `2^21` to `2^24`, and the current 10M balanced row sits at **106.856 ms**.
+- **BigMath beats GMP on balanced multiplication across the 5M-10M band and is near parity at 20M.** Radix-4 + radix-8 fused NTT butterflies (PRs #59, #60) added 1.5-1.6× wall-clock vs prior. The MFA gate is still `2^24`, and the current 10M balanced row sits at **102.150 ms**.
 - **MFA / Bailey 6-step CRT NTT (PR #65) is now reserved for the very-large regime.** The default gate is `2^24` transform coefficients. Focused limb benchmarks show this avoids the 300k-2M limb regression band while preserving MFA wins at 3M+ limbs.
 - Below 500k, GMP's hand-tuned basecase keeps a 1.5-3.3× lead.
-- **Skewed mults: BigMath is around parity at 500k×50k and 1M×100k, with a slight BigMath lead at 2M×200k.** BigMath falls back behind GMP at 50M×5M and 100M×10M, and is 1.25× at the new 200M×20M row.
+- **Skewed mults: BigMath is around parity at 500k×50k and 1M×100k, with parity at 2M×200k.** BigMath wins at 50M×5M, but falls back behind GMP at 100M×10M and 200M×20M.
 
 ### MFA focused threshold check
 
@@ -197,29 +197,29 @@ Balanced (`a.size() == b.size()`) — quotient is 1-2 limbs, both libraries shor
 | 50 000 × 50 000 | <0.001 | <0.001 | 0.43× |
 | 100 000 × 100 000 | <0.001 | 0.001 | 0.23× |
 | 500 000 × 500 000 | 0.001 | 0.005 | 0.18× |
-| 1 000 000 × 1 000 000 | 0.001 | 0.010 | 0.10× |
-| 5 000 000 × 5 000 000 | 1.367 | 0.189 | 7.23× |
+| 1 000 000 × 1 000 000 | 0.001 | 0.010 | 0.05× |
+| 5 000 000 × 5 000 000 | 1.222 | 0.177 | 6.90× |
 
 Skewed (`a.size() >> b.size()`) — Newton/BZ band, real algorithmic work:
 
 | size | BigMath ms | GMP ms | BM/GMP |
 |---|---:|---:|---:|
-| 40 000 × 10 000 | 0.732 | 0.218 | 3.36× |
-| 100 000 × 10 000 | 2.178 | 0.450 | 4.84× |
-| 200 000 × 50 000 | 11.088 | 1.704 | 6.51× |
-| 500 000 × 100 000 | 17.742 | 4.608 | 3.85× |
-| 1 000 000 × 200 000 | 33.632 | 10.022 | 3.36× |
-| 2 000 000 × 500 000 | 82.969 | 24.724 | 3.36× |
-| 5 000 000 × 1 000 000 | 200.641 | 69.900 | 2.87× |
-| 10 000 000 × 2 000 000 | 432.299 | 154.709 | 2.79× |
-| 20 000 000 × 4 000 000 | 982.906 | 353.134 | 2.78× |
-| 50 000 000 × 10 000 000 | 2 464.018 | 1 324.260 | 1.86× |
-| 100 000 000 × 20 000 000 | 6 119.817 | 2 594.778 | 2.36× |
+| 40 000 × 10 000 | 0.754 | 0.225 | 3.36× |
+| 100 000 × 10 000 | 2.314 | 0.462 | 5.01× |
+| 200 000 × 50 000 | 11.837 | 1.748 | 6.77× |
+| 500 000 × 100 000 | 17.375 | 4.553 | 3.82× |
+| 1 000 000 × 200 000 | 33.711 | 9.849 | 3.42× |
+| 2 000 000 × 500 000 | 83.035 | 24.422 | 3.40× |
+| 5 000 000 × 1 000 000 | 201.419 | 67.656 | 2.98× |
+| 10 000 000 × 2 000 000 | 424.143 | 152.982 | 2.77× |
+| 20 000 000 × 4 000 000 | 995.535 | 347.341 | 2.87× |
+| 50 000 000 × 10 000 000 | 2 482.322 | 1 299.545 | 1.91× |
+| 100 000 000 × 20 000 000 | 6 083.450 | 2 541.933 | 2.39× |
 | **200 000 000 × 40 000 000** | **12 979.547** | **4 550.205** | **2.85×** ← MFA flowing through Newton |
 
 **Observations:**
 
-- **Skewed division ratio narrows from ~6.5× at 200k×50k peak to 1.86× at 50M×10M, then rises back to 2.85× at 200M×40M.** Newton inherits the multiplication wins in the large regime, including MFA once the internal products cross the useful MFA band. The residual gap is still division-structure overhead, not decimal I/O.
+- **Skewed division ratio narrows from ~6.8× at 200k×50k peak to 1.91× at 50M×10M, then rises back to 2.85× at 200M×40M.** Newton inherits the multiplication wins in the large regime, including MFA once the internal products cross the useful MFA band. The residual gap is still division-structure overhead, not decimal I/O.
 - 200k×50k stays the worst point: divisor sits below the Newton band (2596 limbs), goes through BZ which loses ~6.5× to GMP's `mpn_dcpi1_div_q` at this size.
 - Residual 2.8-3.9× gap in the 1M-20M skewed band is the structural cost of Newton's chunked iteration vs GMP's single recursive divide with precomputed inverse.
 - Balanced cases route through FastDivision short-circuits and aren't algorithmically meaningful at this size profile. The 5M×5M balanced case was regressing 27.03× before PR #56 fix (BZ misroute on degenerate quotient); now 7.20× via FastDivision short-circuit.
@@ -244,39 +244,39 @@ Representative Base2_64 results:
 
 | size (digits) | BigMath ms | GMP ms | BM/GMP |
 |---|---:|---:|---:|
-| 1 000 | 0.002 | 0.002 | 1.50× |
-| 10 000 | 0.115 | 0.038 | 3.01× |
-| 50 000 | 1.375 | 0.387 | 3.43× |
-| 100 000 | 3.263 | 1.047 | 3.10× |
-| 500 000 | 22.262 | 9.270 | 2.40× |
-| 1 000 000 | 49.037 | 21.071 | 2.33× |
-| 2 000 000 | 106.786 | 48.047 | 2.22× |
-| 5 000 000 | 269.753 | 150.835 | 1.79× |
-| 10 000 000 | 585.600 | 350.776 | 1.67× |
-| 20 000 000 | 1 270.039 | 814.355 | 1.56× |
-| 50 000 000 | 5 168.943 | 2 689.193 | 1.92× |
-
-**Observation:** ratio narrows through the 10M-20M sweet spot (3.2× at 100k → **1.56× at 20M**) where BigMath's NTT overtakes GMP's basecase. It widens back to 1.92× at 50M as GMP's SSA activates.
+| 1 000 | 0.002 | 0.002 | 1.48× |
+| 10 000 | 0.115 | 0.038 | 2.98× |
+| 50 000 | 1.377 | 0.401 | 3.44× |
+| 100 000 | 3.342 | 1.074 | 3.11× |
+| 500 000 | 22.149 | 8.819 | 2.51× |
+| 1 000 000 | 48.767 | 20.711 | 2.35× |
+| 2 000 000 | 106.201 | 47.759 | 2.22× |
+| 5 000 000 | 269.017 | 148.432 | 1.81× |
+| 10 000 000 | 583.973 | 349.941 | 1.67× |
+| 20 000 000 | 1 263.926 | 803.500 | 1.57× |
+| 50 000 000 | 5 372.849 | 2 664.778 | 2.02× |
+
+**Observation:** ratio narrows through the 10M-20M sweet spot (3.1× at 100k → **1.57× at 20M**) where BigMath's NTT overtakes GMP's basecase. It widens back to 2.02× at 50M as GMP's SSA activates.
 
 ---
 
 ## ToString (BigInteger → string)
 
 | size (digits) | BigMath ms | GMP ms | BM/GMP |
 |---|---:|---:|---:|
-| 1 000 | 0.007 | 0.004 | 1.82× |
-| 10 000 | 0.274 | 0.078 | 3.50× |
-| 50 000 | 4.044 | 0.861 | 4.70× |
-| 100 000 | 19.803 | 2.357 | 8.40× |
-| 200 000 | 40.760 | 6.417 | 6.35× |
-| 500 000 | 106.486 | 20.667 | 5.15× |
-| 1 000 000 | 224.755 | 50.174 | 4.48× |
-| 2 000 000 | 481.526 | 119.342 | 4.03× |
-| 5 000 000 | 1 104.781 | 386.146 | 2.86× |
-| 10 000 000 | 2 416.172 | 913.741 | 2.64× |
-| 20 000 000 | 5 432.991 | 2 155.828 | 2.52× |
-
-**Observation:** narrowest gap at 1k (the linear leaf, where GM div2by1 already runs); peaks at 100k (D&C overhead + Newton recip setup not yet amortized); narrows again from 200k onward as D&C asymptotic + NTT inheriting from multiplication's overtake compound — **8.09× at 100k → 2.57× at 20M**.
+| 1 000 | 0.002 | 0.002 | 1.48× |
+| 10 000 | 0.115 | 0.038 | 2.98× |
+| 50 000 | 1.377 | 0.401 | 3.44× |
+| 100 000 | 3.342 | 1.074 | 3.11× |
+| 500 000 | 22.149 | 8.819 | 2.51× |
+| 1 000 000 | 48.767 | 20.711 | 2.35× |
+| 2 000 000 | 106.201 | 47.759 | 2.22× |
+| 5 000 000 | 269.017 | 148.432 | 1.81× |
+| 10 000 000 | 583.973 | 349.941 | 1.67× |
+| 20 000 000 | 1 263.926 | 803.500 | 1.57× |
+| 50 000 000 | 5 372.849 | 2 664.778 | 2.02× |
+
+**Observation:** narrowest gap at 1k (the linear leaf, where GM div2by1 already runs); peaks at 100k (D&C overhead + Newton recip setup not yet amortized); narrows again from 200k onward as D&C asymptotic + NTT inheriting from multiplication's overtake compound — **8.31× at 100k → 2.59× at 20M**.
 
 ### ToString focused warm benchmark
 
@@ -310,17 +310,17 @@ Balanced operands (e.g. `100.10` = 100 integer + 10 fractional digits):
 | Add | 100.10 | 0.000 | 0.000 | 0.00× |
 | Add | 1000.100 | 0.000 | 0.000 | 0.00× |
 | Add | 5000.500 | 0.001 | 0.000 | 5.66× |
-| Add | 20000.2000 | 0.003 | 0.001 | 5.00× |
+| Add | 20000.2000 | 0.002 | 0.001 | 4.50× |
 |---|---|---|---|---|
-| Mul | 100.10 | 0.000 | 0.000 | 3.95× |
-| Mul | 1000.100 | 0.003 | 0.001 | 2.33× |
-| Mul | 5000.500 | 0.038 | 0.018 | 2.12× |
-| Mul | 20000.2000 | 0.345 | 0.091 | 3.78× |
+| Mul | 100.10 | 0.000 | 0.000 | 3.05× |
+| Mul | 1000.100 | 0.003 | 0.001 | 2.31× |
+| Mul | 5000.500 | 0.038 | 0.018 | 2.11× |
+| Mul | 20000.2000 | 0.351 | 0.092 | 3.82× |
 |---|---|---|---|---|
-| Div | 100.10 (10 dp) | 0.001 | 0.000 | 6.53× |
-| Div | 1000.100 (100 dp) | 0.003 | 0.002 | 1.41× |
-| Div | **5000.500 (500 dp)** | **0.023** | **0.026** | **0.89×** ← BigMath faster |
-| Div | 20000.2000 (2000 dp) | 0.233 | 0.198 | 1.18× |
+| Div | 100.10 (10 dp) | 0.001 | 0.000 | 8.33× |
+| Div | 1000.100 (100 dp) | 0.003 | 0.002 | 1.47× |
+| Div | **5000.500 (500 dp)** | **0.023** | **0.025** | **0.92×** ← BigMath faster |
+| Div | 20000.2000 (2000 dp) | 0.228 | 0.194 | 1.17× |
 
 Division at varying target scales (operand = 2000 integer + 200 fractional digits):
 
@@ -338,19 +338,19 @@ Parse (`string → BigDecimal`):
 
 | size (digits) | BigMath ms | GMP (mpf) ms | BM/GMP |
 |---|---:|---:|---:|
-| 100 | 0.001 | 0.001 | 1.00× |
-| 1 000 | 0.005 | 0.005 | 1.10× |
-| 10 000 | 0.138 | 0.107 | 1.29× |
-| 50 000 | 1.504 | 1.017 | 1.48× |
+| 100 | 0.001 | 0.000 | 1.09× |
+| 1 000 | 0.005 | 0.004 | 1.09× |
+| 10 000 | 0.134 | 0.105 | 1.28× |
+| 50 000 | 1.450 | 1.015 | 1.43× |
 
 ToString (`BigDecimal → string`):
 
 | size (digits) | BigMath ms | GMP (mpf) ms | BM/GMP |
 |---|---:|---:|---:|
-| 100 | 0.000 | 0.000 | 0.64× |
-| 1 000 | 0.006 | 0.004 | 1.41× |
-| 10 000 | 0.269 | 0.102 | 2.63× |
-| 50 000 | 4.003 | 1.094 | 3.66× |
+| 100 | 0.000 | 0.000 | 0.82× |
+| 1 000 | 0.006 | 0.005 | 1.39× |
+| 10 000 | 0.269 | 0.102 | 2.64× |
+| 50 000 | 3.976 | 1.101 | 3.61× |
 
 **Observations:**
 - BigDecimal division beats GMP at 5000.500 to 500 dp (0.023 vs 0.026 ms).
@@ -361,14 +361,14 @@ ToString (`BigDecimal → string`):
 
 ## Headline summary
 
-- **Multiplication 5M-20M balanced:** BigMath is faster at 5M and 10M, then slips back to parity at 20M. The current 10M balanced row is **106.856 ms vs 209.888 ms**, or 0.51×.
-- **Multiplication ≥50M balanced:** GMP still wins via SSA. At 200M the gap widens again to 2.16× on this snapshot.
-- **Multiplication skewed:** parity around 500k×50k and 1M×100k, with a slight BigMath lead at 2M×200k. BigMath falls behind at 50M×5M, 100M×10M, and 200M×20M.
-- **Division skewed:** 1.86×-6.51× behind GMP in the main band, with the new 200M×40M row at 2.85×. Worst remains 200k×50k (BZ band). The large-multiplication speedups still flow through Newton, but the structure overhead never disappears.
+- **Multiplication 5M-20M balanced:** BigMath is faster at 5M and 10M, then slips back to slightly behind GMP at 20M. The current 10M balanced row is **102.150 ms vs 209.436 ms**, or 0.49×.
+- **Multiplication ≥50M balanced:** GMP still wins via SSA. At 200M the gap is 2.00× on this snapshot.
+- **Multiplication skewed:** parity around 500k×50k and 1M×100k, with parity at 2M×200k. BigMath wins at 50M×5M, but falls behind at 100M×10M and 200M×20M.
+- **Division skewed:** 1.91×-6.77× behind GMP in the main band, with the 200M×40M row at 2.85×. Worst remains 200k×50k (BZ band). The large-multiplication speedups still flow through Newton, but the structure overhead never disappears.
 - **Division balanced 5M×5M:** PR #56 fix routes degenerate-quotient cases to FastDivision; 27.03× → 7.20×.
-- **Parse:** **1.56× at 20M** (best), widens to 1.92× at 50M as GMP's SSA path dominates.
-- **ToString:** 2.52× at 20M, narrowing from an 8.40× peak at 100k.
-- **BigDecimal division:** beats GMP at small target scales (0.20-1.00×) and at the 500 dp near-parity point (0.89×).
+- **Parse:** **1.57× at 20M** (best), widens to 2.02× at 50M as GMP's SSA path dominates.
+- **ToString:** 2.59× at 20M, narrowing from an 8.31× peak at 100k.
+- **BigDecimal division:** beats GMP at small target scales (0.20-0.87×) and at the 500 dp near-parity point (0.92×).
 
 For optimizations considered and rejected with measurement evidence, see the **Explored but rejected** sections of each subsystem doc. The 2026-05 optimization stack (LIMB_64 + CRT NTT + threading + M-G reciprocals + BZ Knuth fix + degenerate-quotient guard + radix-4/radix-8 fused butterflies + MFA) closed the GMP gap by 3-5× across every band up to the 20M crossover; past 20M, GMP's SSA still wins but the loss factor halved.
 

diff --git a/README.md b/README.md
@@ -117,26 +117,26 @@ Apple M1 Max, vs GMP 6.3.0, `-O3 -march=native`, full default stack (`BIGMATH_LI
 |---|---|---:|---:|---:|
 | mul | 100 000 × 100 000 | 1.35 ms | 0.78 ms | 1.72× |
 | mul | 1 000 000 × 1 000 000 | 10.5 ms | 9.06 ms | 1.16× |
-| mul | 2 000 000 × 2 000 000 | 21.9 ms | 20.6 ms | 1.07× |
-| mul | **5 000 000 × 5 000 000** | **46.5 ms** | **63.5 ms** | **0.73×** ← BigMath faster |
-| mul | **10 000 000 × 10 000 000** | **105 ms** | **212 ms** | **0.50×** ← BigMath 2.01× faster |
-| mul | 20 000 000 × 20 000 000 | 279 ms | 278 ms | 1.00× ← parity |
-| mul | 50 000 000 × 50 000 000 | 1 231 ms | 660 ms | 1.86× ← GMP SSA recovers |
-| mul | 100 000 000 × 100 000 000 | 2 832 ms | 1 391 ms | 2.04× |
-| mul (skewed) | **1 000 000 / 100 000** | **4.47 ms** | **4.79 ms** | **0.93×** ← BigMath faster |
-| mul (skewed) | 2 000 000 / 200 000 | 9.43 ms | 9.45 ms | 1.00× ← parity |
-| mul (skewed) | 10 000 000 / 1 000 000 | 88.4 ms | 70.5 ms | 1.25× |
-| mul (skewed) | 50 000 000 / 5 000 000 | 667 ms | 666 ms | 1.00× ← parity |
-| div (skewed) | 500 000 / 100 000 | 18.0 ms | 4.63 ms | 3.89× |
-| div (skewed) | 10 000 000 / 2 000 000 | 427 ms | 154 ms | 2.78× |
-| div (skewed) | 50 000 000 / 10 000 000 | 2 500 ms | 1 299 ms | **1.92×** |
-| parse | 1 000 000 digits | 48.7 ms | 20.3 ms | 2.40× |
-| parse | 20 000 000 digits | 1 252 ms | 803 ms | **1.56×** |
-| ToString | 100 000 digits | 19.5 ms | 2.33 ms | 8.35× |
-| ToString | 1 000 000 digits | 224 ms | 49.8 ms | 4.50× |
-| ToString | 20 000 000 digits | 5 437 ms | 2 116 ms | **2.57×** |
-
-**BigMath beats GMP on balanced multiplication across the 5M–10M digit band** and is roughly parity at 20M. The current peak is **10M balanced at 2.01× faster than GMP** (105 ms vs 212 ms). The MFA threshold retune improved that row by avoiding early MFA; at ≥50M GMP's Schönhage-Strassen still recovers. Skewed multiplication is a BigMath win around 1M×100k and parity at 2M×200k and 50M×5M, with the dispatcher now avoiding Toom-3 on 2:1+ skewed inputs in the pre-NTT band. Skewed division at 50M×10M is **1.92×** as Newton inherits the large-multiplication speedups. ToString narrows from 8.35× at 100k to 2.57× at 20M; parse to 1.56× at 20M. See [BENCHMARK.md](BENCHMARK.md) for the full table or the per-doc ratio tables for the breakdown.
+| mul | 2 000 000 × 2 000 000 | 21.3 ms | 20.5 ms | 1.04× |
+| mul | **5 000 000 × 5 000 000** | **47.2 ms** | **62.9 ms** | **0.75×** ← BigMath faster |
+| mul | **10 000 000 × 10 000 000** | **102 ms** | **209 ms** | **0.49×** ← BigMath 2.05× faster |
+| mul | 20 000 000 × 20 000 000 | 306 ms | 272 ms | 1.12× ← GMP faster |
+| mul | 50 000 000 × 50 000 000 | 1 239 ms | 669 ms | 1.85× ← GMP SSA recovers |
+| mul | 100 000 000 × 100 000 000 | 2 818 ms | 1 481 ms | 1.90× |
+| mul (skewed) | **1 000 000 / 100 000** | **4.30 ms** | **4.52 ms** | **0.95×** ← BigMath faster |
+| mul (skewed) | 2 000 000 / 200 000 | 9.37 ms | 9.36 ms | 1.00× ← parity |
+| mul (skewed) | 10 000 000 / 1 000 000 | 97.7 ms | 70.0 ms | 1.40× |
+| mul (skewed) | 50 000 000 / 5 000 000 | 543 ms | 699 ms | **0.78×** ← BigMath faster |
+| div (skewed) | 500 000 / 100 000 | 17.4 ms | 4.55 ms | 3.82× |
+| div (skewed) | 10 000 000 / 2 000 000 | 424 ms | 153 ms | 2.77× |
+| div (skewed) | 50 000 000 / 10 000 000 | 2 482 ms | 1 300 ms | **1.91×** |
+| parse | 1 000 000 digits | 49.0 ms | 20.7 ms | 2.36× |
+| parse | 20 000 000 digits | 1 264 ms | 804 ms | **1.57×** |
+| ToString | 100 000 digits | 20.0 ms | 2.40 ms | 8.31× |
+| ToString | 1 000 000 digits | 227 ms | 49.7 ms | 4.57× |
+| ToString | 20 000 000 digits | 5 529 ms | 2 133 ms | **2.59×** |
+
+**BigMath beats GMP on balanced multiplication across the 5M–10M digit band** and is roughly parity at 20M. The current peak is **10M balanced at 2.05× faster than GMP** (102 ms vs 209 ms). The MFA threshold retune improved that row by avoiding early MFA; at ≥50M GMP's Schönhage-Strassen still recovers. Skewed multiplication is a BigMath win around 1M×100k and parity at 2M×200k, with BigMath also ahead again at 50M×5M. Skewed division at 50M×10M is **1.91×** as Newton inherits the large-multiplication speedups. ToString narrows from 8.31× at 100k to 2.59× at 20M; parse to 1.57× at 20M. See [BENCHMARK.md](BENCHMARK.md) for the full table or the per-doc ratio tables for the breakdown.
 
 Opt-out flags (`-DBIGMATH_USE_THREADS=0` / `-DBIGMATH_NTT_CRT=0` / `-DBIGMATH_LIMB_64=0`) revert any subset of the defaults — useful for embedded targets, header-only-strict consumers, or A/B comparison.