Add topology-derived sparse attention kernel by teerthsharma · Pull Request #22 · triton-lang/kernels

teerthsharma · 2026-07-01T19:39:27Z

Summary

Add a forward-only Triton scheduled-attention kernel that consumes causal CSR block schedules.
Add topology-derived schedule construction using sink blocks, local-window blocks, and a 0D-persistence-style salience score over key-block centroids.
Support both [seq, dim] and [batch, heads, seq, dim] inputs with the same CSR schedule shared across batch/head lanes.
Preserve the input dtype for the attention output, and preserve the key tensor device for topology-derived schedules so CUDA schedules can be passed directly to the kernel.
Validate schedule-builder inputs and wrapper inputs before launch for block size, head dimension, CSR shape, integer schedule dtype, CUDA tensors, and schedule device placement.
Add a checked-in benchmark runner to reproduce the sparse-vs-dense-CSR table.
Add tests for public export, dense causal CSR layout, topology block selection, schedule validation, wrapper validation, schedule device preservation, benchmark formatting, dtype preservation, 2D correctness, and batched/headed correctness.

Closes #21.

This was originally attempted in triton-lang/triton#10768. Maintainer feedback there said the Triton core repo was probably not the right place and that a repo of Triton kernels would be a better fit, so this PR moves the contribution here instead of reopening the core PR.

Local validation

wsl.exe -d Ubuntu -- bash -lc "cd /mnt/c/Users/seal/Documents/GitHub/kernels && /home/seal/.cache/codex-triton-topology-venv/bin/python -m py_compile benchmarking/__init__.py benchmarking/topology_sparse_attention.py kernels/topology_sparse_attention.py test/test_topology_sparse_attention.py && /home/seal/.cache/codex-triton-topology-venv/bin/python -m pytest -s --tb=short test/test_topology_sparse_attention.py"

Result: 17 passed.

git diff --check -- .

Result: no whitespace errors. Git emitted local CRLF warnings only.

Benchmark smoke command:

wsl.exe -d Ubuntu -- bash -lc "cd /mnt/c/Users/seal/Documents/GitHub/kernels && /home/seal/.cache/codex-triton-topology-venv/bin/python -m benchmarking.topology_sparse_attention --seq 1024 --rounds 3"

Result: command completed and printed the markdown benchmark table.

Local benchmark

To reproduce the full table:

python -m benchmarking.topology_sparse_attention --seq 1024 2048 4096 --rounds 50

Measured on NVIDIA GeForce RTX 4060 Laptop GPU, PyTorch 2.12.1+cu130, Triton 3.7.1, dim=64, block_size=64, 50 timing rounds.

seq	scheduled / dense blocks	block reduction	dense masked ms	full causal SDPA ms	Triton dense CSR ms	Triton scheduled ms	Triton sparse vs dense CSR	Triton vs SDPA	max abs error
1024	59 / 136	56.6%	2.341	0.085	0.098	0.094	1.04x	0.90x	0.0009
2048	114 / 528	78.4%	2.772	0.081	0.125	0.057	2.18x	1.41x	0.0008
4096	398 / 2080	80.9%	10.192	0.186	0.242	0.070	3.48x	2.68x	0.0009

The main Triton-side comparison is Triton sparse vs dense CSR: both sides use this PR's scheduled-attention Triton kernel, but the dense CSR path visits every causal block. The SDPA number is a full-causal optimized baseline, not the same sparse mask; the 1024-token case is slower than SDPA, so this should not be read as a universal SDPA speedup claim.

teerthsharma · 2026-07-01T20:08:12Z

Requesting review: this PR moves the earlier triton-lang/triton#10768 work here following the repo-fit feedback and now includes focused correctness/validation tests plus a checked-in benchmark runner. @ThomasRaoux, if there is a better reviewer for kernels examples, could you route this to the right maintainer?

ThomasRaoux

looks good to me, just one comment

teerthsharma · 2026-07-01T21:09:00Z

looks good to me, just one comment

Thanks, fixed. I moved the optional pandas dependency handling into benchmark_utils.compare_benchmarks, where pandas is actually used, and restored benchmarking/init.py to the normal exports. Added a regression test for importing benchmark_utils without pandas installed. any other blockers? @ThomasRaoux how can I move forward?

ThomasRaoux · 2026-07-03T00:11:23Z

looks good to me, just one comment

Thanks, fixed. I moved the optional pandas dependency handling into benchmark_utils.compare_benchmarks, where pandas is actually used, and restored benchmarking/init.py to the normal exports. Added a regression test for importing benchmark_utils without pandas installed. any other blockers? @ThomasRaoux how can I move forward?

should we just make panda an explicit dependency? I thought it was already the case

teerthsharma · 2026-07-03T15:10:46Z

looks good to me, just one comment

Thanks, fixed. I moved the optional pandas dependency handling into benchmark_utils.compare_benchmarks, where pandas is actually used, and restored benchmarking/init.py to the normal exports. Added a regression test for importing benchmark_utils without pandas installed. any other blockers? @ThomasRaoux how can I move forward?

should we just make panda an explicit dependency? I thought it was already the case

I see two reasonable designs:

Make pandas an explicit project dependency, since benchmarking.compare_benchmarks already depends on it. Then benchmarking/init.py can keep exporting compare_benchmarks directly with no lazy handling.
Keep pandas optional and localize the import inside compare_benchmarks, so users who only import benchmarking.Profiler or the topology benchmark helpers do not need pandas installed.

i am specifically working with second design in plan But well design one is also valid anyone operating triton must have pandas' installed .
I am fine switching to option 1 if you prefer pandas to be a required dependency for this repo. Which direction would you like? @ThomasRaoux Me personally believe in option 2 as more flexibility plus freedom for more

Add topological sparse attention kernel

3be0b36

teerthsharma mentioned this pull request Jul 1, 2026

Topology-derived CSR block schedule microbenchmark for sparse attention triton-lang/triton#10767

Closed

Preserve sparse attention output dtype

488d448

teerthsharma mentioned this pull request Jul 1, 2026

[Testing] Add topology-derived sparse attention microbenchmark triton-lang/triton#10768

Closed

5 tasks

teerthsharma added 5 commits July 2, 2026 01:20

Validate sparse attention inputs

7eb1c15

Add sparse attention benchmark runner

770c578

Validate sparse attention schedules

46b3763

Validate sparse attention schedule dtypes

beb58fd

Preserve topology schedule device

c6ed3a9

ThomasRaoux reviewed Jul 1, 2026

View reviewed changes

Comment thread benchmarking/__init__.py Outdated

Defer optional pandas benchmark dependency

f87e90a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add topology-derived sparse attention kernel#22

Add topology-derived sparse attention kernel#22
teerthsharma wants to merge 8 commits into
triton-lang:mainfrom
teerthsharma:teerthsharma/topological-sparse-attention

teerthsharma commented Jul 1, 2026 •

edited

Loading

Uh oh!

teerthsharma commented Jul 1, 2026

Uh oh!

ThomasRaoux left a comment

Uh oh!

Uh oh!

teerthsharma commented Jul 1, 2026 •

edited

Loading

Uh oh!

ThomasRaoux commented Jul 3, 2026

Uh oh!

teerthsharma commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

teerthsharma commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Local validation

Local benchmark

Uh oh!

teerthsharma commented Jul 1, 2026

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

teerthsharma commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ThomasRaoux commented Jul 3, 2026

Uh oh!

teerthsharma commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

teerthsharma commented Jul 1, 2026 •

edited

Loading

teerthsharma commented Jul 1, 2026 •

edited

Loading

teerthsharma commented Jul 3, 2026 •

edited

Loading