This file provides guidance to coding agents (Claude Code, Codex) when working with code in this repository.
QuACK (Quirky Assortment of CuTe Kernels) — high-performance CUDA kernels written in CuTe-DSL, targeting H100 (SM90) and B200/B300 (SM100) GPUs. Package name: quack-kernels.
# Install (dev)
pip install -e '.[dev]'
pre-commit install
# For CUDA 13.1
pip install -e '.[dev,cu13]' --extra-index-url https://download.pytorch.org/whl/cu130
# Lint & format
ruff check --fix quack/ tests/
ruff format quack/ tests/
# Run all tests
pytest tests/
# Run a single test
pytest tests/test_rmsnorm.py -x
pytest tests/test_rmsnorm.py::test_rmsnorm_fwd -x -k "bfloat16"For code inside @cute.jit or @cute.kernel decorated functions, only a subset of Python syntax is supported. Follow Control Flow and Limitations.
Key rules:
cutlass.const_expr()marks compile-time constants;cutlass.range_constexpr()unrolls loops at compile timecutlass.range()is for dynamic runtime loops- No early
break/continuein loops; noreturnvalues from jit functions - Python lists/dicts inside DSL are static (compile-time only)
- Types must be determinable at compile time (no dependent types)
- Variables defined inside control flow bodies are not accessible outside
Reduction kernels (rmsnorm.py, softmax.py, cross_entropy.py) inherit from ReductionBase in reduction_base.py. They share a pattern: configure cluster size, get tiled copies, allocate reduction buffers with mbarriers, then launch a @cute.kernel.
GEMM has a multi-layer design:
gemm.py— public API, validates inputs, selects SM version, caches compiled kernelsgemm_interface.py— unified interface across SM versionsgemm_sm90.py/gemm_sm100.py— SM-specific implementationsgemm_default_epi.py+gemm_*_epi.py— epilogue variants (bias, activation, etc.)gemm_config.py—GemmConfigdataclass with tile sizes, cluster dims, swizzle settings
copy_utils.py— memory copy operations (shared↔register, async copies, tiled copies)layout_utils.py— layout algebra (transpose, select, expand, permute)cute_dsl_utils.py— dtype mapping, device capability queries, parameter base classestile_scheduler.py— tile scheduling for persistent kernelsvarlen_utils.py— variable-length sequence support
Tests use pytest with parametrize across dtypes (float32, float16, bfloat16), dimensions, and batch sizes. Each test includes a reference implementation for numerical validation.
When iterating on kernel code, run a small subset of tests (1-3 parametrizations) rather than the full test suite. Use -k or pass specific test IDs to pytest. Only run the full suite when finalizing changes.
When debugging any failure — kernel correctness, torch.compile interaction, test infrastructure — get to the bottom of it. The goal is not to make the test pass; it is to understand why it fails and fix the actual cause.
Do not route around the bug just to make a test pass, for example by pruning a config, skipping a path, resetting state, increasing limits, or switching to a different implementation. A workaround that makes CI green but leaves the bug for users is worse than useless — it hides the problem and the deeper issue will eventually surface at greater cost. Only use workarounds after proving the root cause is external (e.g., upstream PyTorch bug) and documenting why.
Start by reproducing the reported failure, then simplify it to the smallest setting that still fails: reduce batch, M/N/K, tile shape, scheduler options, swap_ab, beta/C, dtype, and epilogue features where possible. Keep the failing behavior intact while removing unrelated complexity.
Use cute.printf inside @cute.jit / @cute.kernel code to print the relevant locations, tile coordinates, tensor coordinates, predicates, and values. Print at the boundaries between stages, for example TMA load, MMA accumulator, epilogue register values, register-to-smem, smem contents, and TMA store coordinates, until the first bad stage is identified.
After finding a fix, verify that the minimized repro passes, the original repro passes, and that temporarily disabling the fix makes the regression test fail. Regression tests should encode the failure mode, not only the high-level symptom.
- Favor concise, self-explanatory code
- Line length: 100 (ruff)
- Ruff allows: lambda assignment (E731), single-char vars I/O/l (E741), unused locals (F841)