Skip to content

AlpinDale/qwen_megakernel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Qwen 0.6B Megakernel for RTX 5090

This megakernel is aggressively optimized for Qwen3-0.6B (bf16) shapes to be run on an RTX 5090.

More details on this blogpost: https://blog.alpindale.net/posts/5090_decode_optimization/

Backend tok/s ms/tok Speedup
PyTorch (HF) 123.3 8.11 1.00x
Megakernel 1036.3 0.99 8.40x

To use this:

uv pip install -r requirements.txt
python -m qwen_megakernel.bench

Not tested on any other GPU, and likely won't run or work. Needs at least CUDA 12.8.

Credits

Based on Elliot Arledge's MegaQwen for the RTX 3090 GPU.

About

Aggressive decode optimizations for Qwen3-0.6B on RTX 5090

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages