Add PL + host code generation for on-board hardware flow by kmhatre14 · Pull Request #4 · dimdano/aie4ml

kmhatre14 · 2026-06-16T05:59:17Z

Adds an opt-in hardware emission layer to the AIE backend, enabling end-to-end execution on the VEK280 board (and hw emulation) alongside the existing aiesim flow.

Includes a hardware-flow tutorial tutorial_3.py in tutorials
Details about the change and execution are in docs/README_PL_integration.md.

- Added PL kernels for copying data fom DRAM to PL to AIE - Added host.cpp for hardware execution - Updated the Makefile to support hw and hw_emu execution

- Added a knob for PL memory selection (URAM vs BRAM)

dimdano · 2026-06-17T16:10:38Z

Thank you for the work. The PL integration is a useful direction and follows the same broad pattern as other AMD projects: AIE graph + PL data movers + system.cfg connectivity + XRT host. Before merging, I think this should be tightened into a strict system-emission layer over the existing aie4ml physical plan. Right now some logic is specialized to the current dense/int8 benchmark and bypasses contracts that already exist in IOPortLayout, artifact emission, and the physical plan. The main fixes I would like are generic RTP artifact loading, dtype-aware boundary packing, deriving all graph IO from build_io_layout(), optional correctness checking and probably making the PL timer/preload behavior an explicit/configurable mode rather than the only system backend behavior.

dimdano · 2026-06-17T16:13:31Z

+        artifacts = inst.variant.get_artifacts(inst)
+        weights = next((a for a in artifacts if a.get('name') == 'weights'), None)
+        if weights is None:
+            continue  # param-less op (no kernel-resident RTP weights)
+        bias = next((a for a in artifacts if a.get('name') == 'bias'), None)


maybe need to get any RTP artifact returned by variant.get_artifacts(), using the same generic contract as app.cpp.jinja. Dense happens to use weights/bias, but LayerNorm uses gamma/beta and future ops may expose LUTs or other parameter tables.

Addressed this in the commit: 29c9060

dimdano · 2026-06-17T16:15:06Z

+#include "../src/weights/{{ L.weights_header }}"
+#include "../src/weights/{{ L.bias_header }}"


any future artifacts must be handled, not only weights/bias

Addressed this in the commit: 29c9060

dimdano · 2026-06-17T16:17:07Z

+        'in_feat': in_feat,
+        'out_feat': out_feat,
+        'in_feat_slice': in_feat // n_ifm,
+        'out_feat_slice': out_feat // cas_num,
+        'cas_num': cas_num,


in this file graph boundary sizes, offsets, dtypes, and PLIO port counts should come from build_io_layout()/IOPortLayout and the physical plan. not inferred

Addressed in 4ba1fa0

dimdano · 2026-06-17T16:24:53Z

+    tensors = list(ports_map)
+    if len(tensors) != 1:
+        raise NotImplementedError(
+            f'system I/O plan supports a single graph {direction} tensor; '
+            f'got {len(tensors)} ({tensors}). Multiple graph {direction}s are not yet supported.'
+        )


Multiple PLIO ports for one tensor are required and supported here. but we might have multiple logical graph inputs/outputs which is unsupported. That is fine for an initial version, but this limitation should be documented clearly. The problem is that the failure is scattered and not expressed as a single explicit hardware boundary contract (i.e., remove duplicated checks from _single_io_feat() and pack_host_data(), etc.).

Removed the duplicate checks and moved it to the start of build_system_io()
Addressed in 4ba1fa0

dimdano · 2026-06-17T16:26:42Z

+    """Round-robin pack per-stream int8 tiles into the data mover's 512-bit DDR layout.
+
+    Returns a little-endian uint32 array (the host buffer element type). Mirrors the data
+    mover: contiguous DDR word i is owned by stream i % n_streams.
+    """
+    word_blocks = []
+    for tile in port_tiles:
+        flat = np.ascontiguousarray(tile).astype(np.int8).reshape(-1)


This silently narrows all graph-boundary data to int8. The PL data mover can remain 128-bit and type-agnostic; probably better to use raw storage bytes from IOPortLayout.dtype and validate alignment. If only int8 is supported in this PR it is fine, just fail hard.

Removed the int8 support while making it support raw storage bytes e04c1e9

dimdano · 2026-06-17T16:29:08Z

+    in_tensor = next(iter(layout.inputs))
+    in_ports = layout.inputs[in_tensor]
+    out_ports = layout.outputs[next(iter(layout.outputs))]
+    batch, in_feat = in_ports[0].numpy_boundary_shape


here also avoid assuming we have 2d inputs. maybe numpy_boundary_shape is useful here.

Used numpy_boundary_shape : e04c1e9

dimdano · 2026-06-17T16:30:05Z

+    body = ', '.join(str(int(v) & 0xFFFFFFFF) for v in values)
+    return f'static const uint32_t {name}[{size_macro}] = {{ {body} }};\n'
+
+# Direct method to write host/data.h TODO: Should be moved to jinja template


Agree with this TODO. For consistency with the rest of the writer, data.h should be emitted via a small jinja template, while Python only prepares the packed values.

data.h now emitted from a Jinja template e04c1e9

dimdano · 2026-06-17T16:31:41Z

+// Compile-time cap on n_iter -- sizes the local URAM buffers. This design
+// preloads ALL n_iter iters on-chip; MAX_N_ITER bounds the URAM footprint.


useful for benchmarking but we need something more scalable as the default deployment data mover. For deployment-style flows, especially where input may come from upstream PL/optical links can we use a streaming or double-buffered MM2S/S2MM-style mover?

This part has been changed drastically

Added MM2S and S2MM kernels in the memory_stream deployment mode f76a86d

MAX iter for benchmark mode is derived from input and output size : af897a5

dimdano · 2026-06-17T16:32:56Z

+    hls::stream<bool, 2> trig_compute_done;
+    hls::stream<bool, 2> stop_sig;
+
+    tick_gen(trig_preload_done, trig_first_send, trig_last_send,


maybe make it as an optional instrumentation? A config like EnablePLTiming for example

Added this as an optional instrument: ca94cb8

dimdano · 2026-06-17T16:36:56Z

+GCC_FLAGS   := -Wall -c -std=c++17 -I$(SDKTARGETSYSROOT)/usr/include/xrt
+GCC_LDFLAGS := -lxrt_coreutil -std=c++17


do we need to pass somewhere --sysroot=${SDKTARGETSYSROOT} ?

Not if you source

$XILINX_VERSAL/environment-setup-cortexa72-cortexa53-amd-linux

This is a part of the common image files from AMD.
You can set it if you want to point to some different $SDKTARGETSYSROOT

dimdano · 2026-06-17T16:38:53Z

also ONNX frontend needs a clean way to set these fields as well, maybe through AIEConfig

Fixed in ef49e3b

The extra config variables are

`
'Target': 'hardware',

'PLMemory': 'uram',

'EnablePLTiming': False,

'PLDataMoverMode': 'benchmark',
`

- Removed the dense layer specific RPT loading

PR 1: Fix the RTP artifact sourcing in host generation

- data-type agnostic - Uses the full n-D from numpy_boundry_shape - The current implementation just strores random values as input and does not have a way to pass user based inputs. As furture work this has to be changes by providing input as a argument for host.exe

PR1: data.h generated using jinja flow.

- Cleaned up the timing measurement prints

PR1: Added a configuration for PL timing. Its disabled by default

- io attributes used from build_io_layout and not infered - removed the duplicate checks for Multi graph intput

- Added separate HLS top level funciton for send (MM2S) and receive (S2MM) - No preloading in deployment - Double buffering enabled by default - PL timing measurement is made as an option for deployment

- Removed redundant code. - Separated Makefile generation for AIE vs System

- The layout dictionary was not used properly resulting in an error during execution for models with multiple layers

… mode. Its only applicable in benchmark mode

Merge Pr1/benchmark vs deploy to main

Adding PL related knobs to ONNX backend Below are the new knobs added to the config generated for ONNX using build_bit_exact_In_add_matmul_model.py def build_config() -> dict: return { 'Part': PART, 'AIEConfig': { 'BatchSize': BATCH, 'Iterations': ITERATIONS, 'Target': TARGET, 'PLMemory': PL_MEMORY, 'EnablePLTiming': ENABLE_PL_TIMING, 'PLDataMoverMode': PL_DATA_MOVER_MODE, }, 'LayerDirectives': LAYER_DIRECTIVES, }

…aie4ml into pr1/uram_size_formula

…ormula

MAX iterations are now derived from input and output size

- PL memory calculation is block based (BRAM/URAM block) - Previous calculation was size based and did not consider the block split required for parallel access

- PL memory budget of 80% is considered for estimation to avoid overutilization. - Fixed the n_ifm and n_ofm was passed incorrectly to the jinja template

Introduce external_stream, a third PLDataMoverMode alongside benchmark and memory_stream, for feeding the AIE array from an external producer. - Emit a synthesizable HLS traffic_gen kernel that drives the AIE input PLIOs with the same AXI-stream contract the generated array expects - No mm2s kernel is generated in this mode: the input side is sourced by traffic_gen instead of a DDR->PLIO mover. The s2mm output path is unchanged, so results are still read back / golden-checkable. - traffic_gen is a drop-in stand-in: on a real design it is replaced by any IP that honors the same PLIO input contract (sc=<producer>.M_AXIS:PLIO_ifm_*). - PL timing is forced off in this mode

Add external_stream PL data-mover mode

kmhatre14 added 4 commits June 15, 2026 21:17

Feature add

916be3f

- Added PL kernels for copying data fom DRAM to PL to AIE - Added host.cpp for hardware execution - Updated the Makefile to support hw and hw_emu execution

Adding a tutorial to run the hardware flow

acbcb9d

Feature add: PL integration

126a7c0

- Added a knob for PL memory selection (URAM vs BRAM)

PL intergration README

56104d9

dimdano self-requested a review June 17, 2026 16:06

dimdano reviewed Jun 17, 2026

View reviewed changes

kmhatre14 and others added 12 commits June 19, 2026 22:31

PR1: RTP in host.cpp now derived directly from artifacts.

fa0a65c

- Removed the dense layer specific RPT loading

Host RPT update method is used in unrolled form.

4a06e29

Merge pull request #1 from kmhatre14/pr1/rtp_fix

29c9060

PR 1: Fix the RTP artifact sourcing in host generation

Merge pull request #2 from kmhatre14/pr1/data_to_jinja

e04c1e9

PR1: data.h generated using jinja flow.

Addressed inferred attributes

6503894

PR1: Added a configuration for PL timing. Its disabled by default

27a80dd

- Cleaned up the timing measurement prints

Merge pull request #3 from kmhatre14/pr1/optional_pl_timing

ca94cb8

PR1: Added a configuration for PL timing. Its disabled by default

Physical Plan derived attributes (fixed)

7103eee

None additional check (fixed)

f33c9db

Removed Duplicate checks

d839661

Merge pull request #4 from kmhatre14/pr1/io_attributes_fix

4ba1fa0

- io attributes used from build_io_layout and not infered - removed the duplicate checks for Multi graph intput

dimdano added the enhancement New feature or request label Jun 24, 2026

kmhatre14 and others added 24 commits June 24, 2026 08:46

PR 1 Deployment vs Benchmark mode

91020fb

- Added separate HLS top level funciton for send (MM2S) and receive (S2MM) - No preloading in deployment - Double buffering enabled by default - PL timing measurement is made as an option for deployment

Code cleanup

2e4c701

- Removed redundant code. - Separated Makefile generation for AIE vs System

Bug fix

d2eafe2

- The layout dictionary was not used properly resulting in an error during execution for models with multiple layers

Remove the PL - AIE, AIE compute and AIE - PL time from memory_stream…

c04afe8

… mode. Its only applicable in benchmark mode

Merge pull request #5 from kmhatre14/pr1/benchmark_vs_deploy

f76a86d

Merge Pr1/benchmark vs deploy to main

Working on adding hls4ml fields in ONNX frontend

1911ea1

simplify ONNX conversion tool

2ec1ba5

uram formula

ca8cb06

Updated the HLS4ML backend PL related flags for onnx backend

2e8a5a2

Fixed naming conventions and removed attribute from json

72ed0a6

Mirrored BRAM end to end

0923d4b

fixed overestimate issue with stream

2e77c3f

fixed overestimate issue with stream

d8b869c

Merge branch 'pr1/uram_size_formula' of https://github.com/kmhatre14/…

86e106c

…aie4ml into pr1/uram_size_formula

issue with old commit fixed

167b8ad

Merge branch 'pr1/uram_size_formula' from vedant into pr1/uram_size_f…

96473f4

…ormula

Removed stale comments

102a955

Merge pull request #7 from kmhatre14/pr1/uram_size_formula

af897a5

MAX iterations are now derived from input and output size

Bug fix

65846d8

- PL memory calculation is block based (BRAM/URAM block) - Previous calculation was size based and did not consider the block split required for parallel access

Bug fix

981cd45

- PL memory budget of 80% is considered for estimation to avoid overutilization. - Fixed the n_ifm and n_ofm was passed incorrectly to the jinja template

Merge pull request #8 from kmhatre14/pr1/external_stream

158b709

Add external_stream PL data-mover mode

Config options updated in tutorial_3.py

91e22d0

		#include "../src/weights/{{ L.weights_header }}"
		#include "../src/weights/{{ L.bias_header }}"

		// Compile-time cap on n_iter -- sizes the local URAM buffers. This design
		// preloads ALL n_iter iters on-chip; MAX_N_ITER bounds the URAM footprint.

		GCC_FLAGS := -Wall -c -std=c++17 -I$(SDKTARGETSYSROOT)/usr/include/xrt
		GCC_LDFLAGS := -lxrt_coreutil -std=c++17

Conversation

kmhatre14 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dimdano commented Jun 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmhatre14 Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kmhatre14 commented Jun 16, 2026 •

edited

Loading

kmhatre14 Jun 21, 2026 •

edited

Loading