This megakernel is aggressively optimized for Qwen3-0.6B (bf16) shapes to be run on an RTX 5090.
More details on this blogpost: https://blog.alpindale.net/posts/5090_decode_optimization/
| Backend | tok/s | ms/tok | Speedup |
|---|---|---|---|
| PyTorch (HF) | 123.3 | 8.11 | 1.00x |
| Megakernel | 1036.3 | 0.99 | 8.40x |
To use this:
uv pip install -r requirements.txt
python -m qwen_megakernel.benchNot tested on any other GPU, and likely won't run or work. Needs at least CUDA 12.8.
Based on Elliot Arledge's MegaQwen for the RTX 3090 GPU.