This document specifies how macOS-memory-benchmark implements standalone TLB analysis mode (-analyze-tlb) in version 0.55.4.
The goal is to provide a reproducible, implementation-accurate description of:
- measurement workflow,
- boundary detection logic,
- confidence scoring,
- derived metrics (entry reach and page-walk penalty),
- JSON verification payload.
This is an implementation whitepaper, not a generic architecture theory paper.
Primary implementation references:
src/benchmark/tlb_analysis.cppsrc/benchmark/tlb_analysis.hsrc/benchmark/tlb_boundary_detector.cppsrc/benchmark/tlb_analysis_json.hsrc/core/config/argument_parser.cpptests/test_analysis.cpp
Related user-facing docs:
README.mdMANUAL.md
Related whitepaper for the other standalone analysis mode:
- CORE_TO_CORE_WHITEPAPER.md — Core-to-Core Cache-Line Handoff Latency Benchmark (
-analyze-core2core)
-analyze-tlb runs a dedicated analysis path and accepts only optional JSON output, optional latency stride override, optional chain-mode override, and optional sweep density:
memory_benchmark -analyze-tlb
memory_benchmark -analyze-tlb -output tlb_analysis.json
memory_benchmark -output tlb_analysis.json -analyze-tlb
memory_benchmark -analyze-tlb -latency-stride-bytes 128 -output tlb_analysis_stride128.json
memory_benchmark -analyze-tlb -latency-chain-mode random-box -tlb-density medium -output tlb_analysis_medium.jsonIf -latency-stride-bytes is not provided, the default stride is 256 bytes, which matches the standard latency mode default (Constants::LATENCY_STRIDE_BYTES).
Sweep density applies only to -analyze-tlb.
low: 15-point base sweep, no refinement passmedium: 15-point base sweep + refinement passhigh(default): 29-point base sweep + refinement pass
All other options are rejected when -analyze-tlb is present.
Example (invalid):
memory_benchmark -analyze-tlb -buffersize 1024The analysis tries these buffers in order:
1024 MB512 MB256 MB
If all fail, mode exits with an insufficient-memory error.
mlock() is attempted for the selected buffer; report shows whether lock succeeded.
After allocation, the code validates that pointer_count = buffer_size / stride >= 2. If stride is very large relative to the selected buffer, mode exits with an error before any measurement begins.
The mode uses fixed constants:
loops_per_point = 30accesses_per_loop = 25,000,000
Stride behavior:
- Effective stride is taken from
-latency-stride-bytes(default: 256 bytes).
Chain mode: The latency chain mode is resolved via resolve_latency_chain_mode(config.latency_chain_mode, page_walk_baseline_locality_bytes). This determines the method used to build the pointer-chase chain for each locality measurement. The resolved mode is reported in the console output and stored in the JSON metadata.
Memory prefaulting: After buffer allocation, the code calls madvise(ptr, size_bytes, MADV_WILLNEED) to prefault pages and reduce page-fault noise during early measurement.
Measurement loop: Each loop rebuilds the latency chain for the target locality, performs warmup, runs pointer chase, and stores latency as ns/access.
Per-point central value is P50 (median) over 30 loops.
Main sweep windows depend on -tlb-density:
-
low/mediumbase set (15 points):16KB,64KB,128KB,256KB,512KB1MB,2MB,4MB,8MB,12MB,16MB,32MB,64MB,128MB,256MB
-
highbase set (29 points):16KB,32KB,64KB,96KB,128KB,192KB,256KB,384KB,512KB,768KB1MB,1536KB,2MB,3MB,4MB,6MB,8MB,10MB,12MB,14MB,16MB24MB,32MB,48MB,64MB,96MB,128MB,192MB,256MB
Refinement policy by density:
low: refinement disabledmedium/high: refinement enabled around detected transitions
Effective sweep start is stride-aware:
min_sweep_locality = max(16KB, 2 * stride)- points below
min_sweep_localityare skipped - if
min_sweep_localityis not in the canonical set, it is inserted as the first sweep point
The final sweep vector is deduplicated (via std::sort followed by std::unique). This prevents duplicates if min_sweep_locality exactly matches one of the canonical points.
Separate page-walk comparison point:
512MB(run only when selected buffer is at least512MB)
Boundary detection uses a recency-weighted segment baseline with multi-point persistence, adaptive noise thresholding, and IQR-overlap rejection.
For candidate index i:
baseline_ns = recency_weighted_average(p50[segment_start, i))— weighted average where pointjreceives weight(j - segment_start + 1). Recent measurements carry more influence, reducing drag from early points that may have had different thermal or frequency conditions.step_ns = p50[i] - baseline_nsnoise_boost_ns = median(IQR of baseline loop-latency rows)(when loop rows are available and baseline has at least 3 points)threshold_ns = max(2.0ns, baseline_ns * 0.10, noise_boost_ns)
Candidate passes threshold when:
step_ns >= threshold_ns
The algorithm scans from segment_start_index + 1 and returns the first candidate that satisfies threshold, guard, and IQR conditions. If no candidate passes, detection returns detected = false.
To avoid classifying early cache transitions as TLB boundaries, candidate locality must also satisfy:
locality_bytes[i] >= tlb_guard_bytes
where:
tlb_guard_bytes = max(2 * L1D_size_bytes, 64 * page_size_bytes)
The guard acts as a hard lower-bound filter on the candidate index i; it does not shift the segment start or baseline calculation.
When per-point raw loop latencies are available (the 30 individual loop measurements), the detector first estimates a baseline-noise floor from IQR values and raises the candidate threshold when needed:
IQR(point_j) = Q3_j - Q1_jnoise_boost_ns = median(IQR(point_j))forj in [segment_start, i)- effective threshold includes this term:
max(2.0ns, 10% baseline, noise_boost_ns)
Then it applies an IQR-overlap check to reject candidates whose step still falls inside baseline noise overlap:
avg_baseline_q3 = average(Q3 of raw loops for each point in [segment_start, i))candidate_q1 = Q1 of raw loops at point i- If
avg_baseline_q3 >= candidate_q1, the baseline's upper noise band overlaps the candidate's lower band → candidate is rejected even if the P50 step exceeds threshold.
This prevents false positives from "lucky medians" where a noisy point happens to have a high P50 due to sampling variance.
Instead of checking a single future point, the detector checks up to 3 future points:
persistent_count = count of j in [i+1, min(i+4, size)) where p50[j] - baseline >= thresholdpersistent_jump = persistent_count >= 2(majority of up to 3)
This makes detection robust against single-point noise dips after a genuine boundary. A boundary at index i followed by one noisy dip and two confirming points will still be classified as persistent.
When the candidate is at or near the last sweep point, there are few or no future points for persistence evaluation. In this case, if the step itself is very large:
strong_last_point = (step_ns >= 8.0) || (step_percent >= 0.25)
Then effective_persistent = persistent_jump || strong_last_point. This prevents downgrading a massive final-point step to Low confidence purely due to lack of future data.
L1 TLB boundary: first candidate passing threshold + guard + IQR check, scanning from sweep start.L2 TLB boundary: first candidate passing threshold + guard + IQR check, scanning from an offset segment start past the L1 boundary.
Private-cache overlap handling:
- The analyzer first checks the direct L1 candidate from sweep start.
- If that candidate is the same locality as a detected private-cache knee, it is preserved as an ambiguous L1 TLB candidate and marked with
overlaps_private_cache_knee = true. - If the direct candidate does not overlap the private-cache knee, the analyzer also tries the cache-knee-offset scan used to avoid obvious cache-knee contamination.
L2 detection specifics:
- Segment start offset: L2 scanning starts at
min(L1_boundary_index + 2, size - 2), excluding the L1 boundary point and its immediate neighbour from the L2 baseline. This prevents L1 transition noise from contaminating the L2 baseline. - L2-specific guard:
max(tlb_guard_bytes, L1_boundary_locality_bytes), preventing L2 from re-detecting at or below the L1 boundary locality.
L2 detection only runs when L1 is detected and its boundary index is not at the last two sweep points.
Technical note (Apple Silicon):
- On Apple Silicon, the shared System Level Cache (SLC) can blur translation-only inflection points.
- In practice, the detected
L2 TLB boundaryshould be interpreted as an inferred secondary translation-reach boundary, not a guaranteed pure architectural L2 TLB edge.
Boundary confidence is classified by step strength and multi-point persistence:
strong_step = (step_ns >= 4.0) || (step_percent >= 0.15)persistent_jump: majority of up to 3 future points exceed threshold (see section 5.3). At the last sweep point, a strong step (step_ns >= 8.0orstep_percent >= 0.25) compensates for the lack of future data.
Confidence levels:
- High:
strong_step && persistent_jump - Medium:
strong_step || persistent_jump - Low: otherwise (still passed threshold)
For detected boundaries:
inferred_entries_min = previous_locality_bytes / page_size_bytes
inferred_entries_max = boundary_locality_bytes / page_size_bytes
inferred_entries = midpoint(inferred_entries_min, inferred_entries_max)
Point estimate and range are reported separately for L1 and L2 sections. The point estimate is intentionally the midpoint of the detected locality window, not the upper boundary edge, so the headline value does not imply more precision than the sweep grid supports.
Page-walk penalty is intentionally computed independently from L1/L2 boundary step values:
page_walk_penalty_ns = P50(512MB) - P50(effective baseline locality)
where:
effective baseline locality = P50 latency of the first point in the stride-aware sweep(i.e.,p50_latency_ns[0]; this is measured during the main run, not a separate reference measurement)
The penalty is computed as-is; a negative value indicates measurement noise (e.g., due to thermal variance or CPU frequency scaling) rather than a genuine negative penalty.
Availability rule:
- available only when selected analysis buffer is
>= 512MB - otherwise reported as unavailable (
N/A) with reason.
Report always includes:
- CPU, page size, selected buffer, stride, loops/accesses config
- Resolved latency chain mode
- Fine-sweep refinement summary (added points, total points)
[Private Cache Knee Detection][L1 TLB Detection][L2 TLB / Page Walk]
Boundary detection sections:
When a boundary is detected, the report shows:
- boundary locality,
- inferred entries,
- confidence string with step (
nsand%).
When a boundary is not detected, the section reports "Not detected."
Page-walk section:
- When available: shows page-walk penalty in
ns, with explicit dynamic endpoints (baseline -> 512MB) - When unavailable: shows "N/A" with the reason (e.g., buffer < 512 MB)
When -output <file> is provided with -analyze-tlb, output includes:
-
top-level metadata:
configurationexecution_time_sectlb_analysistimestampversion
-
configurationcontains mode and run constants:- CPU info, page size, L1D size, TLB guard bytes, stride
latency_sample_count(fixed at 30 loops per point)accesses_per_sample(fixed at 25,000,000)latency_chain_mode(resolved chain mode used)performance_coresandefficiency_cores(CPU core configuration)- selected buffer size and whether mlock succeeded
-
tlb_analysiscontains:sweep[]with rawloop_latencies_nsand per-pointp50_latency_nsprivate_cache_knee(withdetected,boundary_locality_kb,confidence, andmay_interfere_with_tlb)l1_tlb_detection(withdetected,boundary_locality_kb,inferred_entries,inferred_entries_method,inferred_entries_min,inferred_entries_max,overlaps_private_cache_knee,confidence, and step metadata)l2_tlb_detection(same structure as L1)page_walk_penaltyblock (available, baseline/comparison metadata, raw comparison loops,penalty_nswhen available)
This payload is designed for full post-run verification and reproducibility checks.
Example file:
results/0.53.7/MacMiniM4_analyzetlb.json
Observed fields in this sample:
tlb_guard_bytes = 1048576(1MB)l1_tlb_detection.boundary_locality_kb = 4096andinferred_entriesnear the midpoint of its detected entry rangel2_tlb_detection.boundary_locality_kb = 8192andinferred_entriesnear the midpoint of its detected entry rangepage_walk_penalty.penalty_ns ~= 81.93
The sweep includes points: [16KB, 32KB, 64KB, 96KB, 128KB, 192KB, 256KB, 384KB, 512KB, 768KB, 1MB, 1536KB, 2MB, 3MB, 4MB, 6MB, 8MB, ...].
The L1 scan begins at 16KB (min_sweep_locality with stride 64 bytes). It computes:
- At
16KB:baseline = P50(16KB) = X ns,step = 0, fails threshold. - At
64KB:baseline = avg(P50[16KB]),step = P50(64KB) - baseline, still low. - ... (continuing through
1MB,2MB) - At
4MB:baseline = avg(P50[16KB..2MB]),step = P50(4MB) - baseline ≥ threshold,locality(4MB) = 4096KB ≥ guard, first match → detected at4MB.
With a high-density sweep, a 4MB boundary following a 3MB prior point yields inferred_entries_min = 192, inferred_entries_max = 256, and inferred_entries = 224 as a midpoint estimate. The upper edge (256) is still visible in the range and is typical for a first-level TLB on Apple M4.
- If the entire sweep remained below threshold (e.g., very small buffer or very large stride), L1 would report "Not detected."
- If the 256 MB buffer fallback is used,
page_walk_penaltywill reportN/Aand explain why. - On machines with 4 KB pages (instead of macOS's 16 KB), the same entry count would correspond to a 1 MB boundary.
Interpretation note for Apple Silicon:
- The reported
l2_tlb_detectionvalue is an inferred boundary under this methodology and may reflect combined translation pressure and SLC/memory-hierarchy effects.
- This is a user-space benchmark; it does not directly read hardware PMU counters.
- Latency includes full dependent-load path effects (cache hierarchy, translation pressure, runtime noise).
- Guarding with
max(2*L1D, 64*page)reduces but does not mathematically eliminate all non-TLB artifacts. - On Apple Silicon specifically, SLC behavior can make strict architectural L2 TLB isolation difficult; treat
l2_tlb_detectionas an inferred secondary boundary.
All Apple Silicon Macs use a fixed 16 KB page size. When comparing -analyze-tlb results across different Apple Silicon generations (M1, M2, M3, M4, M5, etc.):
inferred_entriesis calculated as the midpoint of the detected entry range; useinferred_entries_minandinferred_entries_maxwhen comparing uncertainty windows.- The actual entry counts may differ between generations due to microarchitectural changes.
- Comparing
inferred_entriesdirectly across models is valid (same page size), but be aware that TLB capacity has evolved across generations.
If L1 detection reports "Not detected":
- Check the selected buffer size: Is it large enough to sweep through the TLB capacity? The 16 KB sweep start may be insufficient if stride is large.
- Review the stride: If
-latency-stride-bytesis large (e.g., 512 bytes),min_sweep_locality = max(16KB, 2 * stride)may skip the actual L1 TLB boundary. Try re-running with a smaller stride. - Check the raw JSON data: Export with
-outputand inspectsweep[].p50_latency_nsto see if the expected inflection point is present in the raw data. - Verify CPU/system state: Thermal throttling or power-saving modes can obscure boundaries; run in a stable, idle state.
- Keep command line fixed (buffer size, stride, output file).
- Keep thermal/load background stable.
- Use repeated runs (3–5) to verify consistency.
- Verify with exported JSON raw loops and P50 values.
- Inspect
execution_time_secandlatency_chain_modeto ensure measurement conditions were consistent.