An experimental open-source project demonstrating compiler-driven kernel generation for AMD XDNA NPUs using Triton and MLIR-AIR.
Triton-XDNA provides an end-to-end compilation flow that lowers standard Triton kernels directly to AMD NPU hardware — no prebuilt kernel libraries required. It bridges Triton's high-level parallel programming model with AMD's MLIR-AIR/AIE compilation stack, producing XRT-compatible binaries for AMD AI Engine architectures (AIE2 and AIE2P).
Triton kernels are first lowered to compact Linalg compute graphs via triton-shared, then tiled and mapped onto parallel NPU cores using the MLIR Transform dialect, and finally compiled through MLIR-AIR and MLIR-AIE to produce device binaries.
Triton kernel (@triton.jit)
-> triton-shared (Linalg)
-> MLIR Transform dialect (tiling, bufferization, vectorization)
-> MLIR-AIR / MLIR-AIE
-> XRT binary (aie.xclbin)
- For dense matrix multiplication (I8/I16/BF16), compiler-generated kernels achieve performance parity with handwritten NPU implementations
- Over 90% of tested matmul configurations reach at least 90% of baseline throughput; no configuration falls below 80%
- Currently supports matrix multiplication, elementwise operations, softmax, and layer normalization
- Complex compute graphs with reductions and broadcasts are mapped onto parallel NPU tiles
This is an experimental project and we welcome community contributions. Whether it's adding support for new kernel types, improving performance, or extending platform support — we'd love to collaborate.
git clone https://github.com/amd/Triton-XDNA.git
cd Triton-XDNA
git submodule update --init
Please follow the instructions in mlir-aie project on how to install the XDNA driver.
The easiest way to get started is to install the pre-built wheel from GitHub Releases:
python3 -m venv sandbox
source sandbox/bin/activate
python3 -m pip install --upgrade pip
# Install triton-xdna from GitHub Releases
pip install triton-xdna \
--find-links https://github.com/amd/Triton-XDNA/releases/expanded_assets/latest-wheels \
--find-links https://github.com/Xilinx/mlir-aie/releases/expanded_assets/latest-wheels-no-rtti \
--find-links https://github.com/Xilinx/llvm-aie/releases/expanded_assets/nightly \
--find-links https://github.com/Xilinx/mlir-air/releases/expanded_assets/latest-air-wheels-no-rttiNote: To install from a local wheel file:
pip install /path/to/triton_xdna-*.whl \
--find-links https://github.com/Xilinx/mlir-aie/releases/expanded_assets/latest-wheels-no-rtti \
--find-links https://github.com/Xilinx/llvm-aie/releases/expanded_assets/nightly \
--find-links https://github.com/Xilinx/mlir-air/releases/expanded_assets/latest-air-wheels-no-rttiStarting from the root of the repository:
python3 -m venv sandbox
source sandbox/bin/activate
python3 -m pip install --upgrade pip
pip install cmake pybind11 nanobind wheel ninja pytest setuptools Cython
# Install triton-xdna from source and all dependencies automatically
pip install . --no-build-isolation \
--find-links https://github.com/Xilinx/mlir-aie/releases/expanded_assets/latest-wheels-no-rtti \
--find-links https://github.com/Xilinx/llvm-aie/releases/expanded_assets/nightly \
--find-links https://github.com/Xilinx/mlir-air/releases/expanded_assets/latest-air-wheels-no-rttiThis will automatically install all required dependencies:
- mlir-aie
- llvm-aie
- mlir-air
The mlir-air version is pinned in utils/mlir-air-hash.txt. The matching mlir-aie commit is pinned by the mlir-air wheel's [aie] extra, so it's resolved transitively. llvm-aie uses the latest nightly release.
python3 -m venv sandbox
source sandbox/bin/activate
python3 -m pip install --upgrade pip
pip install cmake pybind11 nanobind wheel ninja pytest setuptools Cython
source utils/env_setup.sh
cmake -GNinja -S . -Bbuild
cd build
ninjaCmake shall install the C++ binaries under third_party/triton/python/build.
A triton python package with a new amd_triton_npu backend is also pip installed to the virtual environment sandbox.
Please make sure to run source {path_to_xrt}/setup.sh before running examples.
The test also depends on PyTorch as CPU reference.
cd examples/matmul_bf16_m64_n64_k64
AIR_TRANSFORM_TILING_SCRIPT=transform_aie2.mlir python matmul_bf16_m64_n64_k64.py
Note: The transform_aie2.mlir transform dialect IR is specifically designed for the AIE2 architecture. For AIE2P architecture, use transform_aie2p.mlir instead.
Native Windows builds are supported using MSVC — no WSL or Linux required. The full compilation pipeline (Triton → MLIR → xclbin → XRT dispatch) runs natively on Windows.
- Windows 10/11 (x64)
- Visual Studio 2022 with "Desktop development with C++" workload
- Python 3.10, 3.11, or 3.12 (Xilinx Windows wheels do not yet support 3.13+)
- CMake 3.20+ and Ninja (via pip or standalone)
- AMD NPU driver (installs
xrt_coreutil.dllruntime)
git clone https://github.com/amd/Triton-XDNA.git
cd Triton-XDNA
git submodule update --init
python -m venv venv
.\venv\Scripts\activate
pip install --upgrade pip setuptools wheelPrepare XRT development files (headers, import library, xclbinutil). Download
xrt_windows_sdk.zip from Xilinx/XRT releases
and extract the inner xrt_sdk/xrt/ directory (note the zip's top-level
folder is xrt_sdk/) to C:\Program Files\AMD\xrt:
# The contents of xrt_sdk/xrt/ inside the zip should end up at:
# C:\Program Files\AMD\xrt\include\xrt\xrt_bo.h
# C:\Program Files\AMD\xrt\lib\xrt_coreutil.libRun the automated environment setup (must be dot-sourced so PATH/env vars persist in the current shell):
. .\utils\env_setup.ps1This installs the pre-built wheels (triton-windows, mlir-air[aie] which
transitively pulls mlir-aie and llvm-aie) and the Triton-XDNA backend.
Install build tools, PyTorch, and the MLIR-AIE/AIR/LLVM-AIE stack. The
mlir_air[aie] extra transitively pins matching mlir-aie and pulls
llvm-aie, so a single resolver pass installs the whole stack from the
Xilinx release pages:
pip install cmake ninja lit numpy PyYAML nanobind scipy
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install triton-windows
pip install "mlir_air[aie]" `
-f https://github.com/Xilinx/mlir-air/releases/expanded_assets/latest-air-wheels-no-rtti `
-f https://github.com/Xilinx/mlir-aie/releases/expanded_assets/latest-wheels-no-rtti `
-f https://github.com/Xilinx/llvm-aie/releases/expanded_assets/nightlyTo pin a specific mlir-air version, use the values from
utils/mlir-air-hash.txt:
mlir_air[aie]==<Version>.<Timestamp>+<short-commit>.no.rtti.
Install Triton-XDNA:
$env:TRITON_PLUGIN_DIRS = "$PWD\third_party\triton_shared;$PWD\amd_triton_npu"
pip install -e . --no-build-isolation -vxclbinutil and aiebu-asm — Included in the XRT Windows SDK zip. Ensure they
are on PATH or in <mlir_aie_install>/bin/.
DIA SDK — If the mlir-air cmake build can't find DIA SDK:
subst Z: "C:\Program Files\Microsoft Visual Studio\2022\Community\DIA SDK"cd examples\vec-add
$env:AIR_TRANSFORM_TILING_SCRIPT = "transform_aie2p.mlir"
python vec-add.py| Variable | Purpose |
|---|---|
AIR_TRANSFORM_TILING_SCRIPT |
Path to MLIR transform dialect IR |
XILINX_XRT |
(Optional) Override XRT SDK location if not in C:\Program Files\AMD\xrt |
- Python 3.10, 3.11, and 3.12 only — Xilinx does not publish
mlir-air/mlir-aieWindows wheels for 3.13+ yet - xclbinutil and aiebu-asm must be on PATH (from XRT Windows SDK)
- NPU driver must be installed