Review fixes for the GPU LogUp aux build (#762) by MauroToscano · Pull Request #779 · yetanotherco/lambda_vm

MauroToscano · 2026-07-03T21:23:54Z

Fixes from a review of #762. No behavior change to the proving path beyond freeing device memory earlier; the proof output is untouched.

VRAM (the substantive one)

The trace-domain snapshot retained by the R1 main LDE (GpuLdeBase.trace_dev + TraceTable.main_trace_dev) is consumed exactly once — by the LogUp aux build — but both Arcs lived to the end of the proof, holding a main-trace-sized device buffer (~3 GB on ethrex dims) through the aux-commit + DEEP/FRI VRAM peak, including for tables that never take the GPU aux path. Both Arcs are now dropped right after the aux-build pass, so the reuse win is kept and the resident window shrinks to exactly where it's needed.

Hardening

cfg() in math-cuda/src/logup.rs now returns Err past the u32 grid limit instead of silently truncating the launch (leaving uninitialized output tails). Same pattern and rationale as batch_inverse_ext3_dev.
num_out_cols == 0 in logup_aux_resident is a runtime Err instead of a debug_assert: in release it wrapped num_out - 1 and launched the assemble kernel with a bogus column count.
Descriptor column indices are validated against the table width before any kernel launch. The kernels index main[col*num_rows + row] unchecked (they are never told num_cols), so a mis-authored table was a silent out-of-bounds device read where the CPU path panics cleanly.
logup_aux_resident asserts num_rows >= 1 (it reads row_sum[(num_rows-1)*3..]).

Tests

The term-column and resident-aux parity tests now canonicalize both sides before comparing (the canon3 pattern from the batch_inverse tests). The GPU pipeline computes the same field values through different op trees (tree-scan inverse, closed-form accumulate), so raw limbs can differ in the rare non-canonical band — a latent flake. The fingerprint test intentionally stays raw: that kernel mirrors the CPU evaluator op for op.
Clippy findings in the cuda-gated test module fixed (that module is never linted by CI, which runs clippy without the cuda feature).

CI

New test-stark-cuda-lib job runs the CPU-runnable descriptor/CPU-mirror parity tests (cargo test --release -p stark --features cuda --lib logup_gpu) on a plain runner — without nvcc the kernels build as empty PTX stubs and the descriptor tests never touch the driver. Previously nothing in CI even compiled logup_gpu, since no job enables the cuda feature. (The GPU-only #[ignore] tests still need a manual run on the GPU box; note make test-cuda-integration targets -p lambda-vm-prover only, so it does not run the stark-lib GPU tests either.)

Docs

Reattached three doc comments orphaned by items inserted beneath them (coset_lde_row_major_inner, u64_to_ext3_vec, logup_aux_resident) and updated the inner LDE return-tuple doc to the new 4-tuple.
try_expand_..._keep_dev borrows the resident buffer (D2D copy) rather than consuming it; GPU_LOGUP_CALLS counts successes, not attempts.
The disable_event_tracking comment now states the actual cross-stream invariant (producer host-syncs before its handle escapes) — this PR's trace_dev/ResidentAux handoffs made the old "we never share slices across streams" claim false.
Deduplicated split_interactions (reuse the lookup.rs definition) and fixed the build_accumulated_column_from_terms name in logup.cu.

Validation

cargo test -p stark --lib: 169 passed.
cargo test -p stark --features cuda --lib logup_gpu: descriptor parity tests pass on a GPU-less machine (GPU tests stay ignored).
cargo clippy clean on stark + math-cuda both with and without the cuda feature.
Not run here (no GPU): the #[ignore] GPU parity tests and the cuda integration test — worth one run on the GPU box before merging.

The trace-domain snapshot kept by the R1 main LDE (GpuLdeBase.trace_dev, mirrored into TraceTable.main_trace_dev) is consumed exactly once, by the LogUp aux build. Both Arcs previously lived to the end of the proof, holding a main-trace-sized device buffer through the aux-commit + DEEP/FRI VRAM peak — including for tables that never take the GPU aux path. Drop them right after the aux-build pass instead.

…s, CI Hardening: - cfg() returns Err past the u32 grid limit instead of silently truncating the launch (same rationale and pattern as batch_inverse_ext3_dev). - num_out_cols == 0 is a runtime Err, not a debug_assert: in release it would wrap num_out - 1 and launch the assemble kernel with a bogus column count. - Validate descriptor column indices against the table width before launch; the kernels index main[col*num_rows + row] unchecked, so a mis-authored table was silent OOB device reads where the CPU path panics cleanly. - logup_aux_resident asserts num_rows >= 1 (reads row_sum[(num_rows-1)*3..]). Tests: - Canonicalize both sides of the term-column and resident-aux parity asserts (canon3 pattern from the batch_inverse tests): the GPU pipeline computes the same field values through different op trees, so raw limbs can rarely differ in the non-canonical band. The fingerprint test stays raw on purpose — that kernel mirrors the CPU evaluator op for op. - Fix clippy findings in the cuda-gated test module. Docs: - Reattach three doc comments orphaned by items inserted under them (coset_lde_row_major_inner, u64_to_ext3_vec, logup_aux_resident) and update the inner LDE return-tuple description to the new 4-tuple. - try_expand_..._keep_dev borrows the resident buffer (D2D copy), it does not consume it; GPU_LOGUP_CALLS counts successes, not attempts. - disable_event_tracking comment now states the real cross-stream invariant (producer host-syncs before a handle escapes) instead of claiming slices are never shared across streams. - Deduplicate split_interactions (reuse the lookup.rs definition) and name build_accumulated_column_from_terms correctly in logup.cu. CI: - New job runs the CPU-runnable logup_gpu descriptor/CPU-mirror parity tests (cargo test -p stark --features cuda --lib logup_gpu); previously no CI job compiled the module at all since nothing enables the cuda feature.

MauroToscano added 2 commits July 3, 2026 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Review fixes for the GPU LogUp aux build (#762)#779

Review fixes for the GPU LogUp aux build (#762)#779
MauroToscano wants to merge 2 commits into
gpu_logup_auxfrom
gpu_logup_aux_review_fixes

MauroToscano commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MauroToscano commented Jul 3, 2026

VRAM (the substantive one)

Hardening

Tests

CI

Docs

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant