Skip to content

Add topology-derived sparse attention kernel#22

Open
teerthsharma wants to merge 8 commits into
triton-lang:mainfrom
teerthsharma:teerthsharma/topological-sparse-attention
Open

Add topology-derived sparse attention kernel#22
teerthsharma wants to merge 8 commits into
triton-lang:mainfrom
teerthsharma:teerthsharma/topological-sparse-attention

Conversation

@teerthsharma

@teerthsharma teerthsharma commented Jul 1, 2026

Copy link
Copy Markdown

Summary

  • Add a forward-only Triton scheduled-attention kernel that consumes causal CSR block schedules.
  • Add topology-derived schedule construction using sink blocks, local-window blocks, and a 0D-persistence-style salience score over key-block centroids.
  • Support both [seq, dim] and [batch, heads, seq, dim] inputs with the same CSR schedule shared across batch/head lanes.
  • Preserve the input dtype for the attention output, and preserve the key tensor device for topology-derived schedules so CUDA schedules can be passed directly to the kernel.
  • Validate schedule-builder inputs and wrapper inputs before launch for block size, head dimension, CSR shape, integer schedule dtype, CUDA tensors, and schedule device placement.
  • Add a checked-in benchmark runner to reproduce the sparse-vs-dense-CSR table.
  • Add tests for public export, dense causal CSR layout, topology block selection, schedule validation, wrapper validation, schedule device preservation, benchmark formatting, dtype preservation, 2D correctness, and batched/headed correctness.

Closes #21.

This was originally attempted in triton-lang/triton#10768. Maintainer feedback there said the Triton core repo was probably not the right place and that a repo of Triton kernels would be a better fit, so this PR moves the contribution here instead of reopening the core PR.

Local validation

wsl.exe -d Ubuntu -- bash -lc "cd /mnt/c/Users/seal/Documents/GitHub/kernels && /home/seal/.cache/codex-triton-topology-venv/bin/python -m py_compile benchmarking/__init__.py benchmarking/topology_sparse_attention.py kernels/topology_sparse_attention.py test/test_topology_sparse_attention.py && /home/seal/.cache/codex-triton-topology-venv/bin/python -m pytest -s --tb=short test/test_topology_sparse_attention.py"

Result: 17 passed.

git diff --check -- .

Result: no whitespace errors. Git emitted local CRLF warnings only.

Benchmark smoke command:

wsl.exe -d Ubuntu -- bash -lc "cd /mnt/c/Users/seal/Documents/GitHub/kernels && /home/seal/.cache/codex-triton-topology-venv/bin/python -m benchmarking.topology_sparse_attention --seq 1024 --rounds 3"

Result: command completed and printed the markdown benchmark table.

Local benchmark

To reproduce the full table:

python -m benchmarking.topology_sparse_attention --seq 1024 2048 4096 --rounds 50

Measured on NVIDIA GeForce RTX 4060 Laptop GPU, PyTorch 2.12.1+cu130, Triton 3.7.1, dim=64, block_size=64, 50 timing rounds.

seq scheduled / dense blocks block reduction dense masked ms full causal SDPA ms Triton dense CSR ms Triton scheduled ms Triton sparse vs dense CSR Triton vs SDPA max abs error
1024 59 / 136 56.6% 2.341 0.085 0.098 0.094 1.04x 0.90x 0.0009
2048 114 / 528 78.4% 2.772 0.081 0.125 0.057 2.18x 1.41x 0.0008
4096 398 / 2080 80.9% 10.192 0.186 0.242 0.070 3.48x 2.68x 0.0009

The main Triton-side comparison is Triton sparse vs dense CSR: both sides use this PR's scheduled-attention Triton kernel, but the dense CSR path visits every causal block. The SDPA number is a full-causal optimized baseline, not the same sparse mask; the 1024-token case is slower than SDPA, so this should not be read as a universal SDPA speedup claim.

@teerthsharma

Copy link
Copy Markdown
Author

Requesting review: this PR moves the earlier triton-lang/triton#10768 work here following the repo-fit feedback and now includes focused correctness/validation tests plus a checked-in benchmark runner. @ThomasRaoux, if there is a better reviewer for kernels examples, could you route this to the right maintainer?

@ThomasRaoux ThomasRaoux left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, just one comment

Comment thread benchmarking/__init__.py Outdated
@teerthsharma

teerthsharma commented Jul 1, 2026

Copy link
Copy Markdown
Author

looks good to me, just one comment

Thanks, fixed. I moved the optional pandas dependency handling into benchmark_utils.compare_benchmarks, where pandas is actually used, and restored benchmarking/init.py to the normal exports. Added a regression test for importing benchmark_utils without pandas installed. any other blockers? @ThomasRaoux how can I move forward?

@ThomasRaoux

Copy link
Copy Markdown

looks good to me, just one comment

Thanks, fixed. I moved the optional pandas dependency handling into benchmark_utils.compare_benchmarks, where pandas is actually used, and restored benchmarking/init.py to the normal exports. Added a regression test for importing benchmark_utils without pandas installed. any other blockers? @ThomasRaoux how can I move forward?

should we just make panda an explicit dependency? I thought it was already the case

@teerthsharma

teerthsharma commented Jul 3, 2026

Copy link
Copy Markdown
Author

looks good to me, just one comment

Thanks, fixed. I moved the optional pandas dependency handling into benchmark_utils.compare_benchmarks, where pandas is actually used, and restored benchmarking/init.py to the normal exports. Added a regression test for importing benchmark_utils without pandas installed. any other blockers? @ThomasRaoux how can I move forward?

should we just make panda an explicit dependency? I thought it was already the case

I see two reasonable designs:

  1. Make pandas an explicit project dependency, since benchmarking.compare_benchmarks already depends on it. Then benchmarking/init.py can keep exporting compare_benchmarks directly with no lazy handling.

  2. Keep pandas optional and localize the import inside compare_benchmarks, so users who only import benchmarking.Profiler or the topology benchmark helpers do not need pandas installed.

i am specifically working with second design in plan But well design one is also valid anyone operating triton must have pandas' installed .
I am fine switching to option 1 if you prefer pandas to be a required dependency for this repo. Which direction would you like? @ThomasRaoux Me personally believe in option 2 as more flexibility plus freedom for more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add topology-derived CSR scheduled sparse attention kernel

2 participants