Skip to content

fix(cudnn): implement cuDNN 9 error codes, replace todo!() panic with proper mapping#371

Open
CharryWu wants to merge 2 commits intoRust-GPU:mainfrom
CharryWu:fix/cudnn9-error-codes
Open

fix(cudnn): implement cuDNN 9 error codes, replace todo!() panic with proper mapping#371
CharryWu wants to merge 2 commits intoRust-GPU:mainfrom
CharryWu:fix/cudnn9-error-codes

Conversation

@CharryWu
Copy link
Copy Markdown
Contributor

@CharryWu CharryWu commented Mar 31, 2026

Summary

Fixes a runtime process panic in crates/cudnn/src/error.rs that occurs when the crate is compiled against cuDNN 9+. The wildcard arm _ => todo!() in IntoResult::into_result() would abort the process whenever cuDNN returned any of the new hierarchical sub-codes introduced in cuDNN 9. This PR replaces it with a proper category-based fallback mapping.


Problem

Background: how cuDNN version detection works

At compile time, the build script reads the cuDNN version from the linker metadata and emits a cfg flag:

┌─────────────────────────────────────────────────────────────────────────────┐
│  crates/cudnn/build.rs                                                      │
│                                                                             │
│  DEP_CUDNN_VERSION ──► parse as u32 ──► >= 90000?                          │
│                                                 │                           │
│                                        ┌────────┴────────┐                 │
│                                       YES               NO                 │
│                                        │                 │                 │
│                                        ▼                 ▼                 │
│                             cargo::rustc-cfg=cudnn9   (nothing)            │
└─────────────────────────────────────────────────────────────────────────────┘
                    │
                    ▼
        Source files compiled with #[cfg(cudnn9)]
        or #[cfg(not(cudnn9))] blocks selected accordingly

That part works fine. The bug lives entirely in the runtime code that the cfg selection activates.

The broken runtime path (before this PR)

cuDNN 9 restructured cudnnStatus_t into a hierarchical numeric system. Each broad error category was split into specific sub-codes that share the same thousands-digit as their parent:

  Category numeric prefix   Examples (cuDNN 9 new sub-codes)
  ═══════════════════════╦═══════════════════════════════════════════════════
  2xxx  BAD_PARAM        ║  2002 CUDNN_STATUS_BAD_PARAM_NULL_POINTER
  3xxx  NOT_SUPPORTED    ║  3007 CUDNN_STATUS_NOT_SUPPORTED_ARCH_MISMATCH
  4xxx  INTERNAL_ERROR   ║  4001 CUDNN_STATUS_INTERNAL_ERROR_ASSERTION_FAILED
  5xxx  EXECUTION_FAILED ║  5001 CUDNN_STATUS_EXECUTION_FAILED_INSUFFICIENT_MEM

The old into_result() only matched the parent codes exactly. Any unrecognised sub-code fell into the wildcard arm:

┌─────────────────────────────────────────────────────────────────────────────┐
│  IntoResult::into_result()  — BEFORE (cfg(cudnn9) build)                   │
│                                                                             │
│  cudnnStatus_t raw value                                                    │
│       │                                                                     │
│       ▼                                                                     │
│  CUDNN_STATUS_SUCCESS ──────────────────────────────────► Ok(())            │
│       │                                                                     │
│  known exact variants (NOT_INITIALIZED, BAD_PARAM …) ──► Err(CudnnError…)  │
│       │                                                                     │
│  _ => todo!()  ◄──── cuDNN 9 sub-codes land HERE                           │
│       │                                                                     │
│       └──────────────────────────────────────────────► 💥 PROCESS PANIC    │
└─────────────────────────────────────────────────────────────────────────────┘

Fix

Runtime path after this PR

┌─────────────────────────────────────────────────────────────────────────────┐
│  IntoResult::into_result()  — AFTER (cfg(cudnn9) build)                    │
│                                                                             │
│  cudnnStatus_t raw value                                                    │
│       │                                                                     │
│       ├─► CUDNN_STATUS_SUCCESS ────────────────────────► Ok(())             │
│       │                                                                     │
│       ├─► Exact cuDNN 8 parent codes ──────────────────► Err(CudnnError…)  │
│       │   (NOT_INITIALIZED, BAD_PARAM, NOT_SUPPORTED …)                    │
│       │                                                                     │
│       ├─► New cuDNN 9 named codes ─────────────────────► Err(CudnnError…)  │
│       │   (SUBLIBRARY_VERSION_MISMATCH, DEPRECATED …)                      │
│       │                                                                     │
│       └─► Unknown cuDNN 9 sub-code  (was: todo!() 💥)                      │
│               │                                                             │
│               ▼                                                             │
│         category = raw_value / 1000 * 1000                                 │
│               │                                                             │
│         ┌─────┴──────────────────────────────────────┐                     │
│         │  2000 ──► CudnnError::BadParam              │                    │
│         │  3000 ──► CudnnError::NotSupported          │                    │
│         │  4000 ──► CudnnError::InternalError         │                    │
│         │  5000 ──► CudnnError::ExecutionFailed       │                    │
│         │  other ─► CudnnError::InternalError         │                    │
│         └────────────────────────────────────────────┘                     │
│               │                                                             │
│               └──────────────────────────────────────► Err(CudnnError…)   │
└─────────────────────────────────────────────────────────────────────────────┘

Round-trip correctness (CudnnError ↔ cudnnStatus_t)

into_raw() is also updated so the new variants can be converted back to their raw status codes, keeping into_raw(into_result(x)) == x for the new named variants:

  ┌──────────────────────┐    into_raw()   ┌───────────────────────────────┐
  │  CudnnError (Rust)   │ ──────────────► │  cudnnStatus_t (C / FFI)      │
  │                      │                 │                               │
  │  SublibraryVersion…  │ ◄────────────── │  SUBLIBRARY_VERSION_MISMATCH  │
  │  Serialization…      │   into_result() │  SERIALIZATION_VERSION…       │
  │  Deprecated          │                 │  DEPRECATED                   │
  │  SublibraryLoading…  │                 │  SUBLIBRARY_LOADING_FAILED    │
  └──────────────────────┘                 └───────────────────────────────┘

Changes

File What changed
crates/cudnn/src/error.rs Add 4 new CudnnError variants behind #[cfg(cudnn9)]; replace _ => todo!() with category-based fallback; wire variants in into_raw(); add unit tests

New CudnnError variants (cuDNN 9 only):

  • SublibraryVersionMismatch — sub-library version mismatch
  • SerializationVersionMismatch — serialisation version mismatch
  • Deprecated — deprecated API called
  • SublibraryLoadingFailed — required sub-library could not be loaded

Removed:

  • _ => todo!() wildcard panic — replaced with deterministic integer-division fallback

New unit tests:

Unit tests in crates/cudnn/src/error.rs — no GPU needed, run with cargo test -p cudnn:

Test Aspect covered
success_maps_to_ok CUDNN_STATUS_SUCCESSOk(())
common_status_codes_map All version-agnostic parent codes → correct CudnnError variant
cudnn8_only_status_codes_map cuDNN 8-only codes (AllocFailed, ArchMismatch, …) gated behind #[cfg(not(cudnn9))]
cudnn9_named_status_codes_map Four new cuDNN 9 named codes → correct new variants
cudnn9_hierarchical_subcodes_map_to_parent_category The core bug path — sub-codes like BAD_PARAM_NULL_POINTER (2002) and NOT_SUPPORTED_SHAPE (3xxx) map to their parent category instead of panicking
cudnn9_into_raw_round_trips_for_named_errors CudnnError → into_raw() → into_result() round-trip for cuDNN 9 variants
into_raw_round_trips_for_common_errors Same round-trip for version-agnostic variants
into_raw_round_trips_for_cudnn8_only_errors Same round-trip for cuDNN 8-only variants

cuDNN 9 tests compile only when DEP_CUDNN_VERSION >= 90000. On cuDNN 8 builds, #[cfg(cudnn9)] skips them.

Run tests with cargo test -p cudnn -- --nocapture


Verification

  • Verified against cudnn_graph.h from cuDNN 9.20 (CUDA 13.2, Anaconda distribution on Windows 11)
  • The cudnn crate compiles cleanly in this configuration

Testing

  • cudnn crate compiles with cuDNN 9.20 on Windows 11 / CUDA 13.2
  • No todo!() remains in the #[cfg(cudnn9)] error-handling path
  • All cuDNN 8 code paths preserved under #[cfg(not(cudnn9))]
  • New named variants are wired in into_raw() for round-trip correctness

…r mapping

cuDNN 9 restructured cudnnStatus_t into a hierarchical numeric system
(2xxx=BAD_PARAM, 3xxx=NOT_SUPPORTED, 4xxx=INTERNAL_ERROR, 5xxx=EXECUTION_FAILED)
and removed several codes present in cuDNN 8.

Changes:
- Add four new CudnnError variants behind #[cfg(cudnn9)]:
  SublibraryVersionMismatch, SerializationVersionMismatch, Deprecated,
  SublibraryLoadingFailed
- Replace the _ => todo!() wildcard in IntoResult::into_result() with a
  category-based fallback that maps cuDNN 9 sub-codes (e.g. BAD_PARAM_NULL_POINTER)
  to their parent category variant using integer division, eliminating the
  runtime panic entirely
- Add wire both new variants in into_raw() for round-trip correctness

Verified against cudnn_graph.h from cuDNN 9.20 (anaconda distribution).
The cudnn crate itself compiles cleanly; only pre-existing cust bindgen
errors prevent a full cargo check -p cudnn from succeeding.

Made-with: Cursor
@CharryWu CharryWu force-pushed the fix/cudnn9-error-codes branch from a5216d2 to 268981e Compare April 5, 2026 20:46
@CharryWu CharryWu marked this pull request as ready for review April 5, 2026 21:28
@CharryWu CharryWu requested a review from frjnn as a code owner April 5, 2026 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant