mmurshed · mmurshed · May 31, 2026 · May 31, 2026
diff --git a/BENCHMARK.md b/BENCHMARK.md
@@ -437,6 +437,7 @@ ToString (`BigDecimal → string`):
 - **Multiplication skewed:** parity around 500k×50k and 1M×100k, with a slight BigMath lead at 2M×200k. BigMath falls behind at 50M×5M, 100M×10M, and 200M×20M.
 - **Division skewed:** 1.86×-6.51× behind GMP in the main band, with the new 200M×40M row at 2.85×. Worst remains 200k×50k (BZ band). The large-multiplication speedups still flow through Newton, but the structure overhead never disappears.
 - **Division balanced 5M×5M:** PR #56 fix routes degenerate-quotient cases to FastDivision; 27.03× → 7.20×.
+- **Division near-balanced (ratio ≈ 2, large `b`):** PR #79 added a Newton balanced band (`b ≥ 98304, a ≥ 2b`) that dodges BZ's 5–60× non-power-of-2 blowup — 8.8× at 500k/250k, 97.7× at the 2¹⁸+1 worst case. Division is now NTT-bound for ratio ≥ 2 and ratio ≥ 3. **Residual:** ratio ∈ (1, 2) at large `b` still on BZ (~2.7× slower than Newton at ratio 1.5).
 - **Parse:** **1.56× at 20M** (best), widens to 1.92× at 50M as GMP's SSA path dominates.
 - **ToString:** 2.52× at 20M, narrowing from an 8.40× peak at 100k.
 - **BigDecimal division:** beats GMP at small target scales (0.20-1.00×) and at the 500 dp near-parity point (0.89×).
@@ -460,6 +461,7 @@ PR #65 added MFA / Bailey 6-step routing inside `NTTMultiplicationCrt`. The defa
 | `BZ_DIVISOR_THRESHOLD` | 512 | 768 | Keep — current 512 already correct (BZ activates at b > 512; the 1024/512 tie is FastDiv noise) |
 | `NEWTON_MEDIUM_B` | 4096 | 8192 | Keep — current dispatch uses NEWTON_MEDIUM_B together with `a ≥ 3b`, all measured 8192-divisor cases at ratio ≥ 4 still win Newton |
 | `NEWTON_SKEW_NUMERATOR/DENOMINATOR` | 3/1 | 4/1 | Keep — bench rows from 5M×1M onward confirm Newton wins at ratio 5 (BigMath path) |
+| `NEWTON_BALANCED_B` / ratio | 98304 / 2/1 | — | Added PR #79 — near-balanced band; below it BZ wins, above it BZ blows up 5–60× on non-pow2 sizes |
 | `NEWTON_HIGH_SKEW_*` | 2048 / 8/1 | (n/a, no rows) | Keep |
 
 **Toom-3 added to dispatch (2026-05-27, skew-gated 2026-05-30):** focused band scan (above, **Toom-3 dispatch band**) found Toom-3 wins by 4-10% in total ∈ [3 584, 4 096] and avoids a 33% NTT-length boundary regression at total 4 608. Dispatch slots Toom-3 between Karatsuba and NTT for total ∈ [2 560, 5 120), but only for non-skewed operands (`max < 2*min`). A skew-focused scan found 2:1+ shapes in that same window are faster through Karatsuba, with wins such as `2560x256` improving from 0.730 ms → 0.343 ms through `Multiply()`. `NTT_MULTIPLICATION_THRESHOLD` remains 5 120. Toom-5 still has no productive band — ties Karatsuba in 64-256 per-operand limbs and degrades sharply above 256, so stays excluded.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -40,13 +40,18 @@ CI (`.github/workflows/qa.yaml`) runs `nanoclaw-task qa-agent` on PRs against `o
 
 **Why no Schönhage-Strassen.** NTT here uses the Goldilocks prime `P = 2^64 - 2^32 + 1` with 16-bit input split. Convolution-accumulation overflow only at ~2^31 limbs (≈ 8 GB operands), so NTT covers every practical size. SSA's asymptotic edge is `log log n`; at any size below ~2^40 limbs the constant-factor cost of mod-2^N+1 arithmetic loses to NTT. The ~700+ lines of SSA aren't justified for this codebase — if you need more big-int wins, look at block-recursive Mulders division or Schönhage's subquadratic GCD instead.
 
-**Division dispatch** (`algorithms/Division.h`):
-- big-and-skewed (`a.size() > 2048 && a.size() > 3*b.size()`) → `BurnikelZieglerDivision`
-- Newton band → `NewtonDivision` — Newton-Raphson reciprocal, O(M(n)). Handles arbitrary `na/nb` via internal blockwise mode (top chunk in [n+1, 2n], slide down by n, thread the remainder). Two dispatch bands: `b ≥ 24576` at ratio ≥ 4/3, or `b ≥ 16384` at ratio ≥ 2. Wins range from 4× (ratio 2) to 20× (ratio 4 at large sizes) vs FD/BZ.
+**Division dispatch** (`algorithms/Division.h`, thresholds defined in `src/algorithms/Division.cpp`):
+- `NewtonDivision` (Newton-Raphson reciprocal, O(M(n)); handles arbitrary `na/nb` via blockwise mode — top chunk in [n+1, 2n], slide down by n, thread the remainder) when any skew band holds:
+  - `b ≥ 4096` at ratio ≥ 3 (`NEWTON_SKEW` 3/1), or
+  - `b ≥ 98304` at ratio ≥ 2 (`NEWTON_BALANCED` 2/1 — the near-balanced band, PR #79), or
+  - `b ≥ 2048` at ratio ≥ 8 (`NEWTON_HIGH_SKEW` 8/1).
+- else `BurnikelZieglerDivision` for power-of-two base when `b > 512` and the BZ band fits (near-balanced `b ≥ 1024, b+32 ≤ a ≤ 3b`, or big-and-skewed `a > 2048 && a > 3b`).
 - otherwise multi-limb → `FastDivision` (Knuth Algorithm D variant)
 - single-limb divisor → `ClassicDivision`
 - `KnuthDivision` and `ReciprocalDivision` are alternates used by correctness tests for cross-checking.
 
+The balanced band exists because BZ's recursive 2n/n halving lands intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes, blowing up 5–60× vs Newton (worst at `n = 2^k+1`); Newton pads once and stays flat. **Known residual:** ratio ∈ (1, 2) at large `b` still routes to BZ and hits the same blowup (~2.7× slower than Newton would be at ratio 1.5) — the balanced band's `a ≥ 2b` lower bound doesn't cover it yet.
+
 When adding a new algorithm, slot the implementation under `algorithms/<op>/<Name>.h`, then update the dispatch in `algorithms/<Op>.h` — the thresholds there are the only place size cutoffs live.
 
 **Tests cross-check by construction.** Correctness tests (`mult_correctness.cpp`, `div_correctness.cpp`) run every algorithm against classic/BurnikelZiegler and compare results limb-for-limb via `Compare(...)`; the division test also verifies the identity `q*b + r == a` and `r < b`. When changing an algorithm, run these — passing tests means the new code agrees with the reference path, not just that it ran.

diff --git a/README.md b/README.md
@@ -13,7 +13,7 @@ Target: pure C++ that reaches parity or better than GMP in selected multiplicati
 - **Radix-4 + radix-8 fused NTT butterflies** (PRs #59, #60). Adjacent radix-2 layers collapse into single load/store butterflies — radix-4 fuses 2 layers (4 elements), radix-8 fuses 3 layers (8 elements). Same modular op count; 3× fewer memory passes vs radix-2. ~1.6× wall-clock at ≥2M limbs.
 - **MFA / Bailey 6-step CRT NTT** (`BIGMATH_NTT_MFA_THRESHOLD=2^24`, default). Very large CRT transforms switch to a cache-friendly matrix Fourier layout. The threshold was retuned upward from `2^21` to avoid regressions in the 300k-2M limb band while keeping wins at `2^24+` transform sizes.
 - **Multithreaded NTT** (`BIGMATH_USE_THREADS=1`, default). Small thread pool (size `min(hw_concurrency, BIGMATH_MAX_THREADS=8)`) parallelizes the CRT path: 6 forwards + 3 inverses as batched work units, one `ParallelDo` dispatch per phase. 2.3-3.4× speedup on large mul / skewed div / parse. Opt out via `-DBIGMATH_USE_THREADS=0` to drop pthread linkage.
-- **Division:** Classic short division → Knuth Algorithm D (`FastDivision` with Möller-Granlund 3-by-2 qhat for Base2_32) → Burnikel–Ziegler (balanced large) → Newton–Raphson with reciprocal caching (skewed large). Identity `q·b + r == a` is cross-checked in `tests/div_correctness.cpp`.
+- **Division:** Classic short division → Knuth Algorithm D (`FastDivision` with Möller-Granlund 3-by-2 qhat for Base2_32) → Burnikel–Ziegler (mid-size near-balanced) → Newton–Raphson with reciprocal caching (skewed large, plus near-balanced ratio ≥ 2 above 98304 limbs where BZ blows up 5–60× on non-power-of-2 divisor sizes — PR #79). Identity `q·b + r == a` is cross-checked in `tests/div_correctness.cpp`.
 - **Squaring:** Specialized Classic / Karatsuba / NTT squarers (1.4–1.6× over `Multiply(a,a)`).
 - **String I/O:** Linear chunked parser/formatter for small inputs, divide-and-conquer with cached Newton reciprocals at scale. Asymptotic `O(M(L) · log L)` both directions.
 - **BigDecimal:** Java-style fixed-point decimal (unscaled BigInteger + int scale) with exact +, −, \*; rounded division taking 8 rounding modes; parse/format covering plain and scientific notation.

diff --git a/division_improvement_plan.md b/division_improvement_plan.md
@@ -57,6 +57,20 @@ Finding: Newton was entering too early for 1024-2048 limb divisors. The retune u
 - high-skew Newton: `b >= 2048` and `a >= 8b`
 - BZ: enabled for both `Base2_32` and `Base2_64`
 
+## 3b. Near-Balanced Newton Band + BZ Non-Power-of-2 Blowup (PR #79)
+
+Status: completed.
+
+Profiling near-balanced (ratio ≈ 2) division at large `b` found two compounding bugs:
+
+1. **Newton single-block pathology.** The single-block path (`na ≤ 2n+1`) ran a `2n+1`-limb chunk through the truncated `(chunk·R) >> 2n` estimate; the error scales with `chunk / B^(2n)` and reaches ~`B` once the chunk exceeds `2n` limbs (the `+1` comes from the Knuth normalize shift on `a ≈ 2n`), overflowing the fixup cap and bailing to quadratic FastDivision (23× spike at `nb = 50000`). Fixed by routing `na > 2n` through the blockwise path.
+
+2. **BZ non-power-of-2 blowup.** BZ's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes; the FFT length doubles and the constant factor compounds across recursion depth into a **5–60× slowdown vs Newton**, worst at `n = 2^k+1` (≈90 s for a 262145-limb divisor vs Newton ~0.9 s).
+
+Fix: new Newton balanced band — `b >= NEWTON_BALANCED_B (98304)` and `a >= 2b` → Newton. Measured 8.8× at 500k/250k, 5.6× at 2M/1M, 97.7× at the 2¹⁸+1 worst case; exact-pow2 sizes (BZ best case) regress ~4%. Harness: `tests/performance/division_balanced_bench.cpp`.
+
+**Residual / next step:** ratio ∈ (1, 2) at large `b` still routes to BZ and hits the same blowup (~2.7× slower than Newton at ratio 1.5). Extending the balanced band below ratio 2 needs a quotient-bulk lower bound so genuinely tiny-quotient `a ≈ b` cases (where a full reciprocal is wasteful) stay on FastDivision.
+
 ## 4. Prototype GMP-Style Pre-Inverted D&C Division
 
 Status: prototyped and rejected for now.

diff --git a/docs/DIVISION.md b/docs/DIVISION.md
@@ -62,7 +62,7 @@ flowchart TD
     C -- no --> D{Compare&#40;a, b&#41;}
     D -- a == b --> Eq[return &#123;1, 0&#125;]
     D -- a &lt; b --> Less[return &#123;0, a&#125;]
-    D -- a &gt; b --> E{Newton band?<br/>b ≥ 4096 and a ≥ 3b<br/>OR b ≥ 2048 and a ≥ 8b}
+    D -- a &gt; b --> E{Newton band?<br/>b ≥ 4096 and a ≥ 3b<br/>OR b ≥ 98304 and a ≥ 2b<br/>OR b ≥ 2048 and a ≥ 8b}
     E -- yes --> N[NewtonDivision]
     E -- no --> F{Power-of-two base<br/>AND b.size &gt; 512<br/>AND BZ shape fits?}
     F -- yes --> BZ[BurnikelZieglerDivision]
@@ -81,6 +81,9 @@ Default thresholds (overridable via `-D...`):
 | `BIGMATH_NEWTON_MEDIUM_B` | `4096` | limbs | Newton lower bound for medium-skew division |
 | `BIGMATH_NEWTON_SKEW_NUMERATOR` | `3` | — | medium-skew Newton requires `a.size() ≥ 3·b.size() / 1` |
 | `BIGMATH_NEWTON_SKEW_DENOMINATOR` | `1` | — | |
+| `BIGMATH_NEWTON_BALANCED_B` | `98304` | limbs | Newton lower bound for the near-balanced band (PR #79) |
+| `BIGMATH_NEWTON_BALANCED_NUMERATOR` | `2` | — | balanced Newton requires `a.size() ≥ 2·b.size() / 1` |
+| `BIGMATH_NEWTON_BALANCED_DENOMINATOR` | `1` | — | |
 | `BIGMATH_NEWTON_HIGH_SKEW_B` | `2048` | limbs | lower bound for very-high-skew Newton |
 | `BIGMATH_NEWTON_HIGH_SKEW_NUMERATOR` | `8` | — | high-skew Newton requires `a.size() ≥ 8·b.size() / 1` |
 | `BIGMATH_NEWTON_HIGH_SKEW_DENOMINATOR` | `1` | — | |
@@ -90,16 +93,19 @@ Default thresholds (overridable via `-D...`):
 The current dispatch logic, paraphrased:
 
 ```
-1. (b.size ≥ 4096 AND a.size ≥ 3·b.size)
-   OR (b.size ≥ 2048 AND a.size ≥ 8·b.size)                               → Newton
+1. (b.size ≥ 4096   AND a.size ≥ 3·b.size)
+   OR (b.size ≥ 98304 AND a.size ≥ 2·b.size)   ← near-balanced band (PR #79)
+   OR (b.size ≥ 2048  AND a.size ≥ 8·b.size)                              → Newton
 2. (Base2_32 OR Base2_64)  AND  b.size > 512  AND
    ( (b.size ≥ 1024 AND a.size ≥ b.size + 32 AND a.size ≤ 3·b.size)
      OR (a.size > 2048 AND a.size > 3·b.size) )                           → Burnikel–Ziegler
 3. else                                                                   → FastDivision
 4. (b.size == 1 inside FastDivision)                                      → ClassicDivision
 ```
 
-The ordering matters: Newton wins on **large skewed** problems because the per-divisor reciprocal setup amortizes over multiple chunks. BZ wins on **near-balanced and mid-size skewed** problems where its 2n/n recursion structure beats both FastDivision and Newton's setup cost. FastDivision is the default workhorse for everything else.
+The ordering matters: Newton wins on **large skewed** problems because the per-divisor reciprocal setup amortizes over multiple chunks. BZ wins on **mid-size near-balanced** problems where its 2n/n recursion structure beats both FastDivision and Newton's setup cost.
+
+**Near-balanced band (PR #79).** Above `NEWTON_BALANCED_B` (98304 limbs), ratio-≥2 division goes to Newton instead of BZ. BZ's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes — the FFT length doubles and the constant factor compounds across recursion depth into a **5–60× slowdown vs Newton**, worst at `n = 2^k + 1` (measured ~90 s for a 262145-limb divisor vs Newton's ~0.9 s). Newton pads once to the working size and stays flat. Exact-power-of-2 divisor sizes are BZ's best case (it ties Newton); they regress ~4 % under this band but are rare in practice. **Known residual:** ratio ∈ (1, 2) at large `b` still routes to BZ and hits the same blowup (~2.7× slower than Newton at ratio 1.5); the band's `a ≥ 2b` lower bound does not cover it. FastDivision is the default workhorse for everything else.
 
 `KnuthDivision` and `ReciprocalDivision` exist as alternate implementations used by correctness tests for cross-checking. They are not in the production dispatch path.
 
@@ -245,7 +251,7 @@ Once `R` is computed, a single division `a / D` becomes:
 
 The implementation calls this `DivideChunk`.
 
-**Blockwise mode for large dividends.** When `na > 2n + 1` (where `n = |b|`), the algorithm processes the dividend in chunks:
+**Blockwise mode for large dividends.** When `na > 2n` (where `n = |b|`), the algorithm processes the dividend in chunks:
 
 ```
 
@@ -265,9 +271,9 @@ Each chunk costs O(M(n)) and there are O(na / n) chunks, so total cost is O((na
 1. The reciprocal `R` is computed once and reused across all chunks.
 2. Each chunk's `DivideChunk` is O(M(n)) thanks to NTT/Karatsuba in the high-half multiplication, not O(n · chunk_size).
 
-**High-precision reciprocal flag.** When `na ≥ 2n`, the implementation runs one **extra Newton iteration at full precision** after the main loop (`EXTRA_REFINE_ITERS = 1`). This is necessary because integer rounding in the standard iteration leaves `R` accurate to only ~half-bits at large `n`, which combined with a dividend window of size ≥ 2n produces a `Q_estimate` error of `O(n)` — too large for the `FIXUP_LIMIT = 8` correction loop to absorb. The extra refinement iteration drops the error to ≤ 1 quotient digit, which the fixup loop handles trivially.
+**Single-block boundary (PR #79).** The single-block path handles `na ≤ 2n`; `na > 2n` goes blockwise. The boundary is `2n`, not `2n+1`: with a `2n+1`-limb chunk the truncated `(chunk·R) >> 2n` quotient estimate underestimates `Q` by up to ~`B` steps (the error is `∝ chunk / B^(2n)`, which reaches `B` once the chunk exceeds `2n` limbs), overflowing the fixup cap and falling back to quadratic `FastDivision`. The `+1` limb routinely appears from the Knuth normalize shift on `a ≈ 2n`. Before this fix, `nb = 50000` spiked to ~5200 ms (vs ~360 ms at neighbouring sizes); routing `na > 2n` through the blockwise path keeps every chunk ≤ 2n and the fixup loop at 0–2 iterations.
 
-Without this flag, the fixup loop diverged at `n = 32768` in early testing, leading to a fallback path that hands off to `FastDivision` if iterations exhaust.
+**High-precision reciprocal flag.** When `na ≥ 2n`, the implementation also runs one **extra Newton iteration at full precision** after the main loop (`EXTRA_REFINE_ITERS = 1`), to drop integer-rounding error in `R` to ≤ 1 quotient digit. Without it the fixup loop diverged at `n = 32768` in early testing, handing off to `FastDivision`.
 
 **Why Newton, why bands.** The structural win of Newton over FastDivision is the O(M(n)) per-chunk cost versus FastDivision's O((m−n+1)·n) quadratic-in-quotient-size cost. The win materializes once the divisor is large enough that NTT/Karatsuba is faster than scalar multi-precision arithmetic, and once the ratio is skewed enough that reciprocal setup amortizes. The 2026-05 tuning lowered the dispatcher band from `b.size ≥ 8192` to `b.size ≥ 1024` with skew threshold `a ≥ 3b`, after a GMP-bench regression at `(a=200k, b=50k digits)` showed Newton was missing the band by 1–3 limbs at boundary cases.