mmurshed · mmurshed · Jun 12, 2026 · Jun 12, 2026
diff --git a/BENCHMARK.md b/BENCHMARK.md
@@ -2,6 +2,8 @@
 
 Measured 2026-05-30 (fresh local rerun). Single-run snapshot capturing the state of `master` after the 2026-05 optimization pass (LIMB_64 default, multi-prime CRT NTT default, multithreaded NTT default, M-G div2by1 in `ClassicDivision`, M-G 3/2 qhat in `FastDivision` Base2_64, BZ Knuth-normalize fix, radix-4 + radix-8 fused NTT butterflies (PRs #59, #60), **Matrix Fourier Algorithm / Bailey 6-step CRT NTT retuned to n ≥ 2^24 coefficients**).
 
+**Division, parse, and ToString tables re-measured 2026-06-11** after the division/dispatch session (PRs #82–#89: FastDivision bit-shift normalization, decimal D&C threshold retune, the wraparound-Newton family — cyclic QB remainder, top-limbs CR, invertappr reciprocal — plus the 4/3 balanced band and quotient-sized division). Multiplication tables are unchanged 2026-05-30 values (multiplication code untouched by that session). See the [2026-06-11 session section](#2026-06-11-division--decimal-io-session-prs-82-89) for the dispatch-shape wins the standard grids don't cover.
+
 For per-subsystem deep dives — algorithms, dispatch, optimization history, rejected approaches — see:
 
 - [docs/MULTIPLICATION.md](docs/MULTIPLICATION.md)
@@ -217,38 +219,37 @@ Balanced (`a.size() == b.size()`) — quotient is 1-2 limbs, both libraries shor
 
 | size | BigMath ms | GMP ms | BM/GMP |
 |---|---:|---:|---:|
-| 1 000 × 1 000 | <0.001 | <0.001 | 10.90× |
-| 5 000 × 5 000 | 0.001 | <0.001 | 6.41× |
-| 10 000 × 10 000 | 0.002 | <0.001 | 6.55× |
-| 50 000 × 50 000 | <0.001 | <0.001 | 0.43× |
-| 100 000 × 100 000 | <0.001 | 0.001 | 0.23× |
-| 500 000 × 500 000 | 0.001 | 0.005 | 0.18× |
-| 1 000 000 × 1 000 000 | 0.001 | 0.010 | 0.10× |
-| 5 000 000 × 5 000 000 | 1.367 | 0.189 | 7.23× |
+| 1 000 × 1 000 | <0.001 | <0.001 | — |
+| 5 000 × 5 000 | 0.001 | <0.001 | — |
+| 10 000 × 10 000 | <0.001 | <0.001 | — |
+| 50 000 × 50 000 | 0.001 | <0.001 | 2.33× |
+| 100 000 × 100 000 | 0.001 | 0.001 | 2.31× |
+| 500 000 × 500 000 | 0.007 | 0.004 | 1.59× |
+| 1 000 000 × 1 000 000 | 0.014 | 0.009 | 1.62× |
+| 5 000 000 × 5 000 000 | 1.355 | 0.162 | 8.39× |
 
 Skewed (`a.size() >> b.size()`) — Newton/BZ band, real algorithmic work:
 
 | size | BigMath ms | GMP ms | BM/GMP |
 |---|---:|---:|---:|
-| 40 000 × 10 000 | 0.732 | 0.218 | 3.36× |
-| 100 000 × 10 000 | 2.178 | 0.450 | 4.84× |
-| 200 000 × 50 000 | 11.088 | 1.704 | 6.51× |
-| 500 000 × 100 000 | 17.742 | 4.608 | 3.85× |
-| 1 000 000 × 200 000 | 33.632 | 10.022 | 3.36× |
-| 2 000 000 × 500 000 | 82.969 | 24.724 | 3.36× |
-| 5 000 000 × 1 000 000 | 200.641 | 69.900 | 2.87× |
-| 10 000 000 × 2 000 000 | 432.299 | 154.709 | 2.79× |
-| 20 000 000 × 4 000 000 | 982.906 | 353.134 | 2.78× |
-| 50 000 000 × 10 000 000 | 2 464.018 | 1 324.260 | 1.86× |
-| 100 000 000 × 20 000 000 | 6 119.817 | 2 594.778 | 2.36× |
-| **200 000 000 × 40 000 000** | **12 979.547** | **4 550.205** | **2.85×** ← MFA flowing through Newton |
-
-**Observations:**
-
-- **Skewed division ratio narrows from ~6.5× at 200k×50k peak to 1.86× at 50M×10M, then rises back to 2.85× at 200M×40M.** Newton inherits the multiplication wins in the large regime, including MFA once the internal products cross the useful MFA band. The residual gap is still division-structure overhead, not decimal I/O.
+| 40 000 × 10 000 | 0.727 | 0.215 | 3.38× |
+| 100 000 × 10 000 | 2.184 | 0.460 | 4.75× |
+| 200 000 × 50 000 | 10.867 | 1.635 | 6.65× |
+| **500 000 × 100 000** | **12.442** | **4.469** | **2.78×** ← was 3.85× |
+| **1 000 000 × 200 000** | **22.599** | **9.723** | **2.32×** ← was 3.36× |
+| **2 000 000 × 500 000** | **45.743** | **24.010** | **1.91×** ← was 3.36× |
+| **5 000 000 × 1 000 000** | **111.407** | **70.057** | **1.59×** ← was 2.87× |
+| **10 000 000 × 2 000 000** | **232.324** | **150.828** | **1.54×** ← was 2.79× |
+| **20 000 000 × 4 000 000** | **519.977** | **344.471** | **1.51×** ← was 2.78× |
+| **50 000 000 × 10 000 000** | **1 461.269** | **1 297.461** | **1.13×** ← was 1.86× |
+| 100 000 000 × 20 000 000 | 6 119.817 | 2 594.778 | 2.36× (2026-05-30, not re-measured) |
+| 200 000 000 × 40 000 000 | 12 979.547 | 4 550.205 | 2.85× (2026-05-30, not re-measured) |
+
+**Observations (updated 2026-06-11):**
+
+- **The wraparound-Newton family (PRs #85–#87) cut the skewed band roughly in half: 500k×100k through 50M×10M now sit at 1.13–2.78× vs GMP, from 1.86–3.85× before.** 5M×1M went 200.6 → 111.4 ms; 50M×10M is at **1.13×, near parity**. The cuts shorten the serial transform chain (cyclic mod-B^L−1 products at half length, top-limbs quotient estimates), which is the lever that converts to wall-clock on the threaded stack.
 - 200k×50k stays the worst point: divisor sits below the Newton band (2596 limbs), goes through BZ which loses ~6.5× to GMP's `mpn_dcpi1_div_q` at this size.
-- Residual 2.8-3.9× gap in the 1M-20M skewed band is the structural cost of Newton's chunked iteration vs GMP's single recursive divide with precomputed inverse.
-- Balanced cases route through FastDivision short-circuits and aren't algorithmically meaningful at this size profile. The 5M×5M balanced case was regressing 27.03× before PR #56 fix (BZ misroute on degenerate quotient); now 7.20× via FastDivision short-circuit.
+- Balanced equal-size rows are degenerate (quotient 0–1 limbs, both libraries short-circuit; sub-15µs absolute through 1M digits). FastDivision's scalar-normalization fix (PR #82, bit-shift normalize, 3.5× on that path) shows up in driver benchmarks rather than these noise-level rows.
 
 ### Shape-focused division dispatch
 
@@ -311,6 +312,37 @@ limb-for-limb across `na ∈ {2n-1, 2n, 2n+1, 2n+2, 3n}` × `nb` near the bounda
 
 ---
 
+## 2026-06-11 division & decimal-IO session (PRs #82-#89)
+
+One-day profiling-driven pass over division and decimal I/O. Full analysis and rejection notes in
+[docs/DIVISION.md](docs/DIVISION.md) and [docs/STRING_CONVERSION.md](docs/STRING_CONVERSION.md);
+this is the headline summary. The refreshed division/parse/ToString tables above are the post-session state.
+
+| PR | change | headline win |
+|---|---|---|
+| #82 | FastDivision bit-shift normalization (replaces scalar-d normalize; kills per-limb `__udivmodti4` remainder denorm) | balanced div 1M digits 0.89 → 0.25 ms (3.5×) |
+| #83 | decimal D&C thresholds: parse 8192 → 2048, tostr 2048 → 1024 (May sweep inverted by the radix-8/MFA/Newton changes) | parse −12…−53% (4k–1M digits), tostr −19…−33% (1.5k–10k) |
+| #84 | doc-only: prepared-transform reuse in `DivideChunk` REJECTED — cut CPU 11% but wall-clock flat (the 6 forward transforms already run concurrently; only serial-chain cuts convert to wall-clock on the threaded stack) | — |
+| #85 | cyclic wrap-around remainder: `Q·b_norm` replaced by its residue mod `B^L−1` at half transform length | skewed div −10–11%, tostr −7% |
+| #86 | top-limbs quotient estimate (GMP `mu_divappr` style): multiply only top n+1 chunk limbs against R | skewed div −20% at transform-boundary sizes, tostr −11% |
+| #87 | invertappr-style wrapped Newton iteration in `ApproxReciprocal` (exact E from cyclic residue + top-slice correction) | one-shot skewed div −24–25%, `Divider` setup −29% |
+| #88 | Newton balanced band ratio 2/1 → 4/3 (post-#85–87 crossover re-measured) | nb=131073 limbs ratio 1.5: 10.7 s → 157 ms |
+| #89 | **quotient-sized division** (`QuotientSizedDivision.h`): divides operand tops for short quotients; cost scales with quotient, not divisor; balanced-band floor 98304 → 24576 limbs | 2^k+1 pathology ratios 1.05–1.25: 1.07–5.35 s → 28–71 ms (38–75×) |
+
+Cumulative on the standard grid: skewed div 5M×1M 200.6 → 111.4 ms (2.87× → 1.59× vs GMP),
+50M×10M at **1.13× — near parity**; tostr 1M 224.8 → 170.2 ms; parse 1M 49.0 → 43.3 ms.
+
+Division dispatch now has no known residual shape holes: ratio ∈ (1, 4/3) at b ≥ 24576 limbs routes to
+quotient-sized division (which also beats BZ at BZ's exact-power-of-2 best case), 4/3 ≤ ratio bands route
+to Newton, small/degenerate shapes keep FastDivision.
+
+Two recorded process lessons: profile-sample arithmetic overcounts parallel-overlapped work (CPU-time
+savings ≠ wall-clock savings — see PR #84's rejection), and the exact Newton tower drifts up to ~B^6 ulps,
+so any residue-window sizing needs ~B^16 headroom (a B^4 window caused an 8× ToString regression via
+silent FastDivision fallbacks before being caught with a `10^500000`-divisor repro).
+
+---
+
 ## In-tree simple harnesses (k-averaged, no GMP)
 
 `tests/multperf_simple.cpp` and `tests/divperf_simple.cpp` are the standalone
@@ -365,39 +397,39 @@ BZ's recursive halving pays off on the near-balanced `4096 × 2048` shape
 
 | size (digits) | BigMath ms | GMP ms | BM/GMP |
 |---|---:|---:|---:|
-| 1 000 | 0.002 | 0.002 | 1.50× |
-| 10 000 | 0.115 | 0.038 | 3.01× |
-| 50 000 | 1.375 | 0.387 | 3.43× |
-| 100 000 | 3.263 | 1.047 | 3.10× |
-| 500 000 | 22.262 | 9.270 | 2.40× |
-| 1 000 000 | 49.037 | 21.071 | 2.33× |
-| 2 000 000 | 106.786 | 48.047 | 2.22× |
-| 5 000 000 | 269.753 | 150.835 | 1.79× |
-| 10 000 000 | 585.600 | 350.776 | 1.67× |
-| 20 000 000 | 1 270.039 | 814.355 | 1.56× |
-| 50 000 000 | 5 168.943 | 2 689.193 | 1.92× |
-
-**Observation:** ratio narrows through the 10M-20M sweet spot (3.2× at 100k → **1.56× at 20M**) where BigMath's NTT overtakes GMP's basecase. It widens back to 1.92× at 50M as GMP's SSA activates.
+| 1 000 | 0.002 | 0.002 | 1.51× |
+| 10 000 | 0.085 | 0.038 | 2.24× |
+| 50 000 | 1.182 | 0.397 | 2.98× |
+| 100 000 | 2.867 | 1.051 | 2.73× |
+| 500 000 | 19.369 | 8.869 | 2.18× |
+| 1 000 000 | 43.339 | 20.552 | 2.11× |
+| 2 000 000 | 95.944 | 47.272 | 2.03× |
+| 5 000 000 | 253.822 | 148.577 | 1.71× |
+| 10 000 000 | 552.055 | 345.782 | 1.60× |
+| 20 000 000 | 1 223.549 | 813.621 | 1.50× |
+| 50 000 000 | 5 236.669 | 2 691.178 | 1.95× |
+
+**Observation (updated 2026-06-11):** the `DecimalDcThreshold` retune (8192 → 2048, PR #83) shaved 10–26% through the 10k–2M band; ratio narrows through the 10M-20M sweet spot (2.7× at 100k → **1.50× at 20M**) where BigMath's NTT overtakes GMP's basecase. It widens back to 1.95× at 50M as GMP's SSA activates.
 
 ---
 
 ## ToString (BigInteger → string)
 
 | size (digits) | BigMath ms | GMP ms | BM/GMP |
 |---|---:|---:|---:|
-| 1 000 | 0.007 | 0.004 | 1.82× |
-| 10 000 | 0.274 | 0.078 | 3.50× |
-| 50 000 | 4.044 | 0.861 | 4.70× |
-| 100 000 | 19.803 | 2.357 | 8.40× |
-| 200 000 | 40.760 | 6.417 | 6.35× |
-| 500 000 | 106.486 | 20.667 | 5.15× |
-| 1 000 000 | 224.755 | 50.174 | 4.48× |
-| 2 000 000 | 481.526 | 119.342 | 4.03× |
-| 5 000 000 | 1 104.781 | 386.146 | 2.86× |
-| 10 000 000 | 2 416.172 | 913.741 | 2.64× |
-| 20 000 000 | 5 432.991 | 2 155.828 | 2.52× |
-
-**Observation:** narrowest gap at 1k (the linear leaf, where GM div2by1 already runs); peaks at 100k (D&C overhead + Newton recip setup not yet amortized); narrows again from 200k onward as D&C asymptotic + NTT inheriting from multiplication's overtake compound — **8.09× at 100k → 2.57× at 20M**.
+| 1 000 | 0.006 | 0.003 | 1.82× |
+| 10 000 | 0.261 | 0.078 | 3.35× |
+| 50 000 | 3.916 | 0.860 | 4.56× |
+| **100 000** | **17.608** | **2.388** | **7.37×** ← was 8.40× |
+| **200 000** | **33.585** | **6.131** | **5.48×** ← was 6.35× |
+| **500 000** | **84.371** | **20.667** | **4.08×** ← was 5.15× |
+| **1 000 000** | **170.229** | **51.464** | **3.31×** ← was 4.48× |
+| **2 000 000** | **352.228** | **121.325** | **2.90×** ← was 4.03× |
+| **5 000 000** | **815.351** | **385.973** | **2.11×** ← was 2.86× |
+| **10 000 000** | **1 732.746** | **925.173** | **1.87×** ← was 2.64× |
+| **20 000 000** | **3 785.269** | **2 127.186** | **1.78×** ← was 2.52× |
+
+**Observation (updated 2026-06-11):** ToString is divider-chain-bound, so it compounds the session's division wins (cyclic QB + top-limbs CR in `DivideChunk`, invertappr reciprocal in chain setup, `BIGMATH_TOSTR_DC_THRESHOLD` 2048 → 1024): 1M digits went 224.8 → 170.2 ms, 20M went 5 433 → 3 785 ms. Gap still peaks around 100k (D&C + chain setup not yet amortized) and narrows to **1.78× at 20M**. Methodology matches the original table: warm best-of-N below 100k, single cold run (chain build included) at 100k+.
 
 ### ToString focused warm benchmark