Skip to content

Review fixes for the GPU LogUp aux build (#762)#779

Open
MauroToscano wants to merge 2 commits into
gpu_logup_auxfrom
gpu_logup_aux_review_fixes
Open

Review fixes for the GPU LogUp aux build (#762)#779
MauroToscano wants to merge 2 commits into
gpu_logup_auxfrom
gpu_logup_aux_review_fixes

Conversation

@MauroToscano

Copy link
Copy Markdown
Contributor

Fixes from a review of #762. No behavior change to the proving path beyond freeing device memory earlier; the proof output is untouched.

VRAM (the substantive one)

The trace-domain snapshot retained by the R1 main LDE (GpuLdeBase.trace_dev + TraceTable.main_trace_dev) is consumed exactly once — by the LogUp aux build — but both Arcs lived to the end of the proof, holding a main-trace-sized device buffer (~3 GB on ethrex dims) through the aux-commit + DEEP/FRI VRAM peak, including for tables that never take the GPU aux path. Both Arcs are now dropped right after the aux-build pass, so the reuse win is kept and the resident window shrinks to exactly where it's needed.

Hardening

  • cfg() in math-cuda/src/logup.rs now returns Err past the u32 grid limit instead of silently truncating the launch (leaving uninitialized output tails). Same pattern and rationale as batch_inverse_ext3_dev.
  • num_out_cols == 0 in logup_aux_resident is a runtime Err instead of a debug_assert: in release it wrapped num_out - 1 and launched the assemble kernel with a bogus column count.
  • Descriptor column indices are validated against the table width before any kernel launch. The kernels index main[col*num_rows + row] unchecked (they are never told num_cols), so a mis-authored table was a silent out-of-bounds device read where the CPU path panics cleanly.
  • logup_aux_resident asserts num_rows >= 1 (it reads row_sum[(num_rows-1)*3..]).

Tests

  • The term-column and resident-aux parity tests now canonicalize both sides before comparing (the canon3 pattern from the batch_inverse tests). The GPU pipeline computes the same field values through different op trees (tree-scan inverse, closed-form accumulate), so raw limbs can differ in the rare non-canonical band — a latent flake. The fingerprint test intentionally stays raw: that kernel mirrors the CPU evaluator op for op.
  • Clippy findings in the cuda-gated test module fixed (that module is never linted by CI, which runs clippy without the cuda feature).

CI

New test-stark-cuda-lib job runs the CPU-runnable descriptor/CPU-mirror parity tests (cargo test --release -p stark --features cuda --lib logup_gpu) on a plain runner — without nvcc the kernels build as empty PTX stubs and the descriptor tests never touch the driver. Previously nothing in CI even compiled logup_gpu, since no job enables the cuda feature. (The GPU-only #[ignore] tests still need a manual run on the GPU box; note make test-cuda-integration targets -p lambda-vm-prover only, so it does not run the stark-lib GPU tests either.)

Docs

  • Reattached three doc comments orphaned by items inserted beneath them (coset_lde_row_major_inner, u64_to_ext3_vec, logup_aux_resident) and updated the inner LDE return-tuple doc to the new 4-tuple.
  • try_expand_..._keep_dev borrows the resident buffer (D2D copy) rather than consuming it; GPU_LOGUP_CALLS counts successes, not attempts.
  • The disable_event_tracking comment now states the actual cross-stream invariant (producer host-syncs before its handle escapes) — this PR's trace_dev/ResidentAux handoffs made the old "we never share slices across streams" claim false.
  • Deduplicated split_interactions (reuse the lookup.rs definition) and fixed the build_accumulated_column_from_terms name in logup.cu.

Validation

  • cargo test -p stark --lib: 169 passed.
  • cargo test -p stark --features cuda --lib logup_gpu: descriptor parity tests pass on a GPU-less machine (GPU tests stay ignored).
  • cargo clippy clean on stark + math-cuda both with and without the cuda feature.
  • Not run here (no GPU): the #[ignore] GPU parity tests and the cuda integration test — worth one run on the GPU box before merging.

The trace-domain snapshot kept by the R1 main LDE (GpuLdeBase.trace_dev,
mirrored into TraceTable.main_trace_dev) is consumed exactly once, by the
LogUp aux build. Both Arcs previously lived to the end of the proof, holding
a main-trace-sized device buffer through the aux-commit + DEEP/FRI VRAM
peak — including for tables that never take the GPU aux path. Drop them
right after the aux-build pass instead.
…s, CI

Hardening:
- cfg() returns Err past the u32 grid limit instead of silently truncating
  the launch (same rationale and pattern as batch_inverse_ext3_dev).
- num_out_cols == 0 is a runtime Err, not a debug_assert: in release it would
  wrap num_out - 1 and launch the assemble kernel with a bogus column count.
- Validate descriptor column indices against the table width before launch;
  the kernels index main[col*num_rows + row] unchecked, so a mis-authored
  table was silent OOB device reads where the CPU path panics cleanly.
- logup_aux_resident asserts num_rows >= 1 (reads row_sum[(num_rows-1)*3..]).

Tests:
- Canonicalize both sides of the term-column and resident-aux parity asserts
  (canon3 pattern from the batch_inverse tests): the GPU pipeline computes the
  same field values through different op trees, so raw limbs can rarely differ
  in the non-canonical band. The fingerprint test stays raw on purpose — that
  kernel mirrors the CPU evaluator op for op.
- Fix clippy findings in the cuda-gated test module.

Docs:
- Reattach three doc comments orphaned by items inserted under them
  (coset_lde_row_major_inner, u64_to_ext3_vec, logup_aux_resident) and update
  the inner LDE return-tuple description to the new 4-tuple.
- try_expand_..._keep_dev borrows the resident buffer (D2D copy), it does not
  consume it; GPU_LOGUP_CALLS counts successes, not attempts.
- disable_event_tracking comment now states the real cross-stream invariant
  (producer host-syncs before a handle escapes) instead of claiming slices
  are never shared across streams.
- Deduplicate split_interactions (reuse the lookup.rs definition) and name
  build_accumulated_column_from_terms correctly in logup.cu.

CI:
- New job runs the CPU-runnable logup_gpu descriptor/CPU-mirror parity tests
  (cargo test -p stark --features cuda --lib logup_gpu); previously no CI job
  compiled the module at all since nothing enables the cuda feature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant