Skip to content

Add PL + host code generation for on-board hardware flow#4

Open
kmhatre14 wants to merge 40 commits into
dimdano:mainfrom
kmhatre14:main
Open

Add PL + host code generation for on-board hardware flow#4
kmhatre14 wants to merge 40 commits into
dimdano:mainfrom
kmhatre14:main

Conversation

@kmhatre14

@kmhatre14 kmhatre14 commented Jun 16, 2026

Copy link
Copy Markdown

Adds an opt-in hardware emission layer to the AIE backend, enabling end-to-end execution on the VEK280 board (and hw emulation) alongside the existing aiesim flow.

Includes a hardware-flow tutorial tutorial_3.py in tutorials
Details about the change and execution are in docs/README_PL_integration.md.

- Added PL kernels for copying data fom DRAM to PL to AIE
- Added host.cpp for hardware execution
- Updated the Makefile to support hw and hw_emu execution
- Added a knob for PL memory selection (URAM vs BRAM)
@dimdano dimdano self-requested a review June 17, 2026 16:06
@dimdano

dimdano commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Thank you for the work. The PL integration is a useful direction and follows the same broad pattern as other AMD projects: AIE graph + PL data movers + system.cfg connectivity + XRT host. Before merging, I think this should be tightened into a strict system-emission layer over the existing aie4ml physical plan. Right now some logic is specialized to the current dense/int8 benchmark and bypasses contracts that already exist in IOPortLayout, artifact emission, and the physical plan. The main fixes I would like are generic RTP artifact loading, dtype-aware boundary packing, deriving all graph IO from build_io_layout(), optional correctness checking and probably making the PL timer/preload behavior an explicit/configurable mode rather than the only system backend behavior.

Comment thread src/aie4ml/system_plan.py Outdated
Comment on lines +103 to +107
artifacts = inst.variant.get_artifacts(inst)
weights = next((a for a in artifacts if a.get('name') == 'weights'), None)
if weights is None:
continue # param-less op (no kernel-resident RTP weights)
bias = next((a for a in artifacts if a.get('name') == 'bias'), None)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe need to get any RTP artifact returned by variant.get_artifacts(), using the same generic contract as app.cpp.jinja. Dense happens to use weights/bias, but LayerNorm uses gamma/beta and future ops may expose LUTs or other parameter tables.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in the commit: 29c9060

Comment on lines +29 to +30
#include "../src/weights/{{ L.weights_header }}"
#include "../src/weights/{{ L.bias_header }}"

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any future artifacts must be handled, not only weights/bias

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in the commit: 29c9060

Comment thread src/aie4ml/system_plan.py
Comment on lines +162 to +166
'in_feat': in_feat,
'out_feat': out_feat,
'in_feat_slice': in_feat // n_ifm,
'out_feat_slice': out_feat // cas_num,
'cas_num': cas_num,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this file graph boundary sizes, offsets, dtypes, and PLIO port counts should come from build_io_layout()/IOPortLayout and the physical plan. not inferred

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4ba1fa0

Comment thread src/aie4ml/system_plan.py Outdated
Comment on lines +47 to +52
tensors = list(ports_map)
if len(tensors) != 1:
raise NotImplementedError(
f'system I/O plan supports a single graph {direction} tensor; '
f'got {len(tensors)} ({tensors}). Multiple graph {direction}s are not yet supported.'
)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple PLIO ports for one tensor are required and supported here. but we might have multiple logical graph inputs/outputs which is unsupported. That is fine for an initial version, but this limitation should be documented clearly. The problem is that the failure is scattered and not expressed as a single explicit hardware boundary contract (i.e., remove duplicated checks from _single_io_feat() and pack_host_data(), etc.).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the duplicate checks and moved it to the start of build_system_io()
Addressed in 4ba1fa0

Comment thread src/aie4ml/system_plan.py Outdated
Comment on lines +184 to +191
"""Round-robin pack per-stream int8 tiles into the data mover's 512-bit DDR layout.

Returns a little-endian uint32 array (the host buffer element type). Mirrors the data
mover: contiguous DDR word i is owned by stream i % n_streams.
"""
word_blocks = []
for tile in port_tiles:
flat = np.ascontiguousarray(tile).astype(np.int8).reshape(-1)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This silently narrows all graph-boundary data to int8. The PL data mover can remain 128-bit and type-agnostic; probably better to use raw storage bytes from IOPortLayout.dtype and validate alignment. If only int8 is supported in this PR it is fine, just fail hard.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the int8 support while making it support raw storage bytes e04c1e9

Comment thread src/aie4ml/system_plan.py Outdated
in_tensor = next(iter(layout.inputs))
in_ports = layout.inputs[in_tensor]
out_ports = layout.outputs[next(iter(layout.outputs))]
batch, in_feat = in_ports[0].numpy_boundary_shape

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also avoid assuming we have 2d inputs. maybe numpy_boundary_shape is useful here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used numpy_boundary_shape : e04c1e9

Comment thread src/aie4ml/system_plan.py Outdated
body = ', '.join(str(int(v) & 0xFFFFFFFF) for v in values)
return f'static const uint32_t {name}[{size_macro}] = {{ {body} }};\n'

# Direct method to write host/data.h TODO: Should be moved to jinja template

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this TODO. For consistency with the rest of the writer, data.h should be emitted via a small jinja template, while Python only prepares the packed values.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.h now emitted from a Jinja template e04c1e9

Comment on lines +42 to +43
// Compile-time cap on n_iter -- sizes the local URAM buffers. This design
// preloads ALL n_iter iters on-chip; MAX_N_ITER bounds the URAM footprint.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useful for benchmarking but we need something more scalable as the default deployment data mover. For deployment-style flows, especially where input may come from upstream PL/optical links can we use a streaming or double-buffered MM2S/S2MM-style mover?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part has been changed drastically

  • Added MM2S and S2MM kernels in the memory_stream deployment mode f76a86d
  • MAX iter for benchmark mode is derived from input and output size : af897a5

hls::stream<bool, 2> trig_compute_done;
hls::stream<bool, 2> stop_sig;

tick_gen(trig_preload_done, trig_first_send, trig_last_send,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make it as an optional instrumentation? A config like EnablePLTiming for example

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as an optional instrument: ca94cb8

Comment on lines +73 to +74
GCC_FLAGS := -Wall -c -std=c++17 -I$(SDKTARGETSYSROOT)/usr/include/xrt
GCC_LDFLAGS := -lxrt_coreutil -std=c++17

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to pass somewhere --sysroot=${SDKTARGETSYSROOT} ?

@kmhatre14 kmhatre14 Jun 21, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not if you source

$XILINX_VERSAL/environment-setup-cortexa72-cortexa53-amd-linux

This is a part of the common image files from AMD.
You can set it if you want to point to some different $SDKTARGETSYSROOT

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also ONNX frontend needs a clean way to set these fields as well, maybe through AIEConfig

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ef49e3b

The extra config variables are

`
'Target': 'hardware',

'PLMemory': 'uram',

'EnablePLTiming': False,

'PLDataMoverMode': 'benchmark',
`

kmhatre14 and others added 12 commits June 19, 2026 22:31
- Removed the dense layer specific RPT loading
PR 1: Fix the RTP artifact sourcing in host generation
- data-type agnostic
- Uses the full n-D from numpy_boundry_shape
- The current implementation just strores random values as input and does not have a way to pass user based inputs. As furture work this has to be changes by providing input as a argument for host.exe
PR1: data.h generated using jinja flow.
- Cleaned up the timing measurement prints
PR1: Added a configuration for PL timing. Its disabled by default
- io attributes used from build_io_layout and not infered
- removed the duplicate checks for Multi graph intput
@dimdano dimdano added the enhancement New feature or request label Jun 24, 2026
kmhatre14 and others added 24 commits June 24, 2026 08:46
- Added separate HLS top level funciton for send (MM2S) and receive (S2MM)
- No preloading in deployment
- Double buffering enabled by default
- PL timing measurement is made as an option for deployment
- Removed redundant code.
- Separated Makefile generation for AIE vs System
- The layout dictionary was not used properly resulting in an error during execution for models with multiple layers
… mode.

Its only applicable in benchmark mode
Merge Pr1/benchmark vs deploy to main
Adding PL related knobs to ONNX backend

Below are the new knobs added to the config generated for ONNX using build_bit_exact_In_add_matmul_model.py

def build_config() -> dict:
    return {
        'Part': PART,
        'AIEConfig': {
            'BatchSize': BATCH,
            'Iterations': ITERATIONS,
            'Target': TARGET,
            'PLMemory': PL_MEMORY,
            'EnablePLTiming': ENABLE_PL_TIMING,
            'PLDataMoverMode': PL_DATA_MOVER_MODE,
        },
        'LayerDirectives': LAYER_DIRECTIVES,
    }
MAX iterations are now derived from input and output size
- PL memory calculation is block based (BRAM/URAM block)
- Previous calculation was size based and did not consider the block split required for parallel access
- PL memory budget of 80% is considered for estimation to avoid overutilization.
- Fixed the n_ifm and n_ofm was passed incorrectly to the jinja template
Introduce external_stream, a third PLDataMoverMode alongside benchmark and
memory_stream, for feeding the AIE array from an external producer.

- Emit a synthesizable HLS traffic_gen kernel that drives the AIE input PLIOs
  with the same AXI-stream contract the generated array expects
- No mm2s kernel is generated in this mode: the input side is sourced by
  traffic_gen instead of a DDR->PLIO mover. The s2mm output path is unchanged,
  so results are still read back / golden-checkable.
- traffic_gen is a drop-in stand-in: on a real design it is replaced by any IP
  that honors the same PLIO input contract (sc=<producer>.M_AXIS:PLIO_ifm_*).
- PL timing is forced off in this mode
Add external_stream PL data-mover mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants