Review fixes for the GPU LogUp aux build (#762)#779
Open
MauroToscano wants to merge 2 commits into
Open
Conversation
The trace-domain snapshot kept by the R1 main LDE (GpuLdeBase.trace_dev, mirrored into TraceTable.main_trace_dev) is consumed exactly once, by the LogUp aux build. Both Arcs previously lived to the end of the proof, holding a main-trace-sized device buffer through the aux-commit + DEEP/FRI VRAM peak — including for tables that never take the GPU aux path. Drop them right after the aux-build pass instead.
…s, CI Hardening: - cfg() returns Err past the u32 grid limit instead of silently truncating the launch (same rationale and pattern as batch_inverse_ext3_dev). - num_out_cols == 0 is a runtime Err, not a debug_assert: in release it would wrap num_out - 1 and launch the assemble kernel with a bogus column count. - Validate descriptor column indices against the table width before launch; the kernels index main[col*num_rows + row] unchecked, so a mis-authored table was silent OOB device reads where the CPU path panics cleanly. - logup_aux_resident asserts num_rows >= 1 (reads row_sum[(num_rows-1)*3..]). Tests: - Canonicalize both sides of the term-column and resident-aux parity asserts (canon3 pattern from the batch_inverse tests): the GPU pipeline computes the same field values through different op trees, so raw limbs can rarely differ in the non-canonical band. The fingerprint test stays raw on purpose — that kernel mirrors the CPU evaluator op for op. - Fix clippy findings in the cuda-gated test module. Docs: - Reattach three doc comments orphaned by items inserted under them (coset_lde_row_major_inner, u64_to_ext3_vec, logup_aux_resident) and update the inner LDE return-tuple description to the new 4-tuple. - try_expand_..._keep_dev borrows the resident buffer (D2D copy), it does not consume it; GPU_LOGUP_CALLS counts successes, not attempts. - disable_event_tracking comment now states the real cross-stream invariant (producer host-syncs before a handle escapes) instead of claiming slices are never shared across streams. - Deduplicate split_interactions (reuse the lookup.rs definition) and name build_accumulated_column_from_terms correctly in logup.cu. CI: - New job runs the CPU-runnable logup_gpu descriptor/CPU-mirror parity tests (cargo test -p stark --features cuda --lib logup_gpu); previously no CI job compiled the module at all since nothing enables the cuda feature.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes from a review of #762. No behavior change to the proving path beyond freeing device memory earlier; the proof output is untouched.
VRAM (the substantive one)
The trace-domain snapshot retained by the R1 main LDE (
GpuLdeBase.trace_dev+TraceTable.main_trace_dev) is consumed exactly once — by the LogUp aux build — but both Arcs lived to the end of the proof, holding a main-trace-sized device buffer (~3 GB on ethrex dims) through the aux-commit + DEEP/FRI VRAM peak, including for tables that never take the GPU aux path. Both Arcs are now dropped right after the aux-build pass, so the reuse win is kept and the resident window shrinks to exactly where it's needed.Hardening
cfg()inmath-cuda/src/logup.rsnow returnsErrpast the u32 grid limit instead of silently truncating the launch (leaving uninitialized output tails). Same pattern and rationale asbatch_inverse_ext3_dev.num_out_cols == 0inlogup_aux_residentis a runtimeErrinstead of adebug_assert: in release it wrappednum_out - 1and launched the assemble kernel with a bogus column count.main[col*num_rows + row]unchecked (they are never toldnum_cols), so a mis-authored table was a silent out-of-bounds device read where the CPU path panics cleanly.logup_aux_residentassertsnum_rows >= 1(it readsrow_sum[(num_rows-1)*3..]).Tests
canon3pattern from thebatch_inversetests). The GPU pipeline computes the same field values through different op trees (tree-scan inverse, closed-form accumulate), so raw limbs can differ in the rare non-canonical band — a latent flake. The fingerprint test intentionally stays raw: that kernel mirrors the CPU evaluator op for op.CI
New
test-stark-cuda-libjob runs the CPU-runnable descriptor/CPU-mirror parity tests (cargo test --release -p stark --features cuda --lib logup_gpu) on a plain runner — without nvcc the kernels build as empty PTX stubs and the descriptor tests never touch the driver. Previously nothing in CI even compiledlogup_gpu, since no job enables the cuda feature. (The GPU-only#[ignore]tests still need a manual run on the GPU box; notemake test-cuda-integrationtargets-p lambda-vm-proveronly, so it does not run the stark-lib GPU tests either.)Docs
coset_lde_row_major_inner,u64_to_ext3_vec,logup_aux_resident) and updated the inner LDE return-tuple doc to the new 4-tuple.try_expand_..._keep_devborrows the resident buffer (D2D copy) rather than consuming it;GPU_LOGUP_CALLScounts successes, not attempts.disable_event_trackingcomment now states the actual cross-stream invariant (producer host-syncs before its handle escapes) — this PR'strace_dev/ResidentAuxhandoffs made the old "we never share slices across streams" claim false.split_interactions(reuse thelookup.rsdefinition) and fixed thebuild_accumulated_column_from_termsname inlogup.cu.Validation
cargo test -p stark --lib: 169 passed.cargo test -p stark --features cuda --lib logup_gpu: descriptor parity tests pass on a GPU-less machine (GPU tests stay ignored).cargo clippyclean on stark + math-cuda both with and without the cuda feature.#[ignore]GPU parity tests and the cuda integration test — worth one run on the GPU box before merging.