Add PL + host code generation for on-board hardware flow#4
Conversation
- Added PL kernels for copying data fom DRAM to PL to AIE - Added host.cpp for hardware execution - Updated the Makefile to support hw and hw_emu execution
- Added a knob for PL memory selection (URAM vs BRAM)
|
Thank you for the work. The PL integration is a useful direction and follows the same broad pattern as other AMD projects: AIE graph + PL data movers + system.cfg connectivity + XRT host. Before merging, I think this should be tightened into a strict system-emission layer over the existing aie4ml physical plan. Right now some logic is specialized to the current dense/int8 benchmark and bypasses contracts that already exist in IOPortLayout, artifact emission, and the physical plan. The main fixes I would like are generic RTP artifact loading, dtype-aware boundary packing, deriving all graph IO from build_io_layout(), optional correctness checking and probably making the PL timer/preload behavior an explicit/configurable mode rather than the only system backend behavior. |
| artifacts = inst.variant.get_artifacts(inst) | ||
| weights = next((a for a in artifacts if a.get('name') == 'weights'), None) | ||
| if weights is None: | ||
| continue # param-less op (no kernel-resident RTP weights) | ||
| bias = next((a for a in artifacts if a.get('name') == 'bias'), None) |
There was a problem hiding this comment.
maybe need to get any RTP artifact returned by variant.get_artifacts(), using the same generic contract as app.cpp.jinja. Dense happens to use weights/bias, but LayerNorm uses gamma/beta and future ops may expose LUTs or other parameter tables.
| #include "../src/weights/{{ L.weights_header }}" | ||
| #include "../src/weights/{{ L.bias_header }}" |
There was a problem hiding this comment.
any future artifacts must be handled, not only weights/bias
| 'in_feat': in_feat, | ||
| 'out_feat': out_feat, | ||
| 'in_feat_slice': in_feat // n_ifm, | ||
| 'out_feat_slice': out_feat // cas_num, | ||
| 'cas_num': cas_num, |
There was a problem hiding this comment.
in this file graph boundary sizes, offsets, dtypes, and PLIO port counts should come from build_io_layout()/IOPortLayout and the physical plan. not inferred
| tensors = list(ports_map) | ||
| if len(tensors) != 1: | ||
| raise NotImplementedError( | ||
| f'system I/O plan supports a single graph {direction} tensor; ' | ||
| f'got {len(tensors)} ({tensors}). Multiple graph {direction}s are not yet supported.' | ||
| ) |
There was a problem hiding this comment.
Multiple PLIO ports for one tensor are required and supported here. but we might have multiple logical graph inputs/outputs which is unsupported. That is fine for an initial version, but this limitation should be documented clearly. The problem is that the failure is scattered and not expressed as a single explicit hardware boundary contract (i.e., remove duplicated checks from _single_io_feat() and pack_host_data(), etc.).
There was a problem hiding this comment.
Removed the duplicate checks and moved it to the start of build_system_io()
Addressed in 4ba1fa0
| """Round-robin pack per-stream int8 tiles into the data mover's 512-bit DDR layout. | ||
|
|
||
| Returns a little-endian uint32 array (the host buffer element type). Mirrors the data | ||
| mover: contiguous DDR word i is owned by stream i % n_streams. | ||
| """ | ||
| word_blocks = [] | ||
| for tile in port_tiles: | ||
| flat = np.ascontiguousarray(tile).astype(np.int8).reshape(-1) |
There was a problem hiding this comment.
This silently narrows all graph-boundary data to int8. The PL data mover can remain 128-bit and type-agnostic; probably better to use raw storage bytes from IOPortLayout.dtype and validate alignment. If only int8 is supported in this PR it is fine, just fail hard.
There was a problem hiding this comment.
Removed the int8 support while making it support raw storage bytes e04c1e9
| in_tensor = next(iter(layout.inputs)) | ||
| in_ports = layout.inputs[in_tensor] | ||
| out_ports = layout.outputs[next(iter(layout.outputs))] | ||
| batch, in_feat = in_ports[0].numpy_boundary_shape |
There was a problem hiding this comment.
here also avoid assuming we have 2d inputs. maybe numpy_boundary_shape is useful here.
| body = ', '.join(str(int(v) & 0xFFFFFFFF) for v in values) | ||
| return f'static const uint32_t {name}[{size_macro}] = {{ {body} }};\n' | ||
|
|
||
| # Direct method to write host/data.h TODO: Should be moved to jinja template |
There was a problem hiding this comment.
Agree with this TODO. For consistency with the rest of the writer, data.h should be emitted via a small jinja template, while Python only prepares the packed values.
| // Compile-time cap on n_iter -- sizes the local URAM buffers. This design | ||
| // preloads ALL n_iter iters on-chip; MAX_N_ITER bounds the URAM footprint. |
There was a problem hiding this comment.
useful for benchmarking but we need something more scalable as the default deployment data mover. For deployment-style flows, especially where input may come from upstream PL/optical links can we use a streaming or double-buffered MM2S/S2MM-style mover?
| hls::stream<bool, 2> trig_compute_done; | ||
| hls::stream<bool, 2> stop_sig; | ||
|
|
||
| tick_gen(trig_preload_done, trig_first_send, trig_last_send, |
There was a problem hiding this comment.
maybe make it as an optional instrumentation? A config like EnablePLTiming for example
| GCC_FLAGS := -Wall -c -std=c++17 -I$(SDKTARGETSYSROOT)/usr/include/xrt | ||
| GCC_LDFLAGS := -lxrt_coreutil -std=c++17 |
There was a problem hiding this comment.
do we need to pass somewhere --sysroot=${SDKTARGETSYSROOT} ?
There was a problem hiding this comment.
Not if you source
$XILINX_VERSAL/environment-setup-cortexa72-cortexa53-amd-linux
This is a part of the common image files from AMD.
You can set it if you want to point to some different $SDKTARGETSYSROOT
There was a problem hiding this comment.
also ONNX frontend needs a clean way to set these fields as well, maybe through AIEConfig
There was a problem hiding this comment.
Fixed in ef49e3b
The extra config variables are
`
'Target': 'hardware',
'PLMemory': 'uram',
'EnablePLTiming': False,
'PLDataMoverMode': 'benchmark',
`
- Removed the dense layer specific RPT loading
PR 1: Fix the RTP artifact sourcing in host generation
- data-type agnostic - Uses the full n-D from numpy_boundry_shape - The current implementation just strores random values as input and does not have a way to pass user based inputs. As furture work this has to be changes by providing input as a argument for host.exe
PR1: data.h generated using jinja flow.
- Cleaned up the timing measurement prints
PR1: Added a configuration for PL timing. Its disabled by default
- io attributes used from build_io_layout and not infered - removed the duplicate checks for Multi graph intput
- Added separate HLS top level funciton for send (MM2S) and receive (S2MM) - No preloading in deployment - Double buffering enabled by default - PL timing measurement is made as an option for deployment
- Removed redundant code. - Separated Makefile generation for AIE vs System
… mode. Its only applicable in benchmark mode
Merge Pr1/benchmark vs deploy to main
Adding PL related knobs to ONNX backend
Below are the new knobs added to the config generated for ONNX using build_bit_exact_In_add_matmul_model.py
def build_config() -> dict:
return {
'Part': PART,
'AIEConfig': {
'BatchSize': BATCH,
'Iterations': ITERATIONS,
'Target': TARGET,
'PLMemory': PL_MEMORY,
'EnablePLTiming': ENABLE_PL_TIMING,
'PLDataMoverMode': PL_DATA_MOVER_MODE,
},
'LayerDirectives': LAYER_DIRECTIVES,
}
…aie4ml into pr1/uram_size_formula
MAX iterations are now derived from input and output size
Introduce external_stream, a third PLDataMoverMode alongside benchmark and memory_stream, for feeding the AIE array from an external producer. - Emit a synthesizable HLS traffic_gen kernel that drives the AIE input PLIOs with the same AXI-stream contract the generated array expects - No mm2s kernel is generated in this mode: the input side is sourced by traffic_gen instead of a DDR->PLIO mover. The s2mm output path is unchanged, so results are still read back / golden-checkable. - traffic_gen is a drop-in stand-in: on a real design it is replaced by any IP that honors the same PLIO input contract (sc=<producer>.M_AXIS:PLIO_ifm_*). - PL timing is forced off in this mode
Add external_stream PL data-mover mode
Adds an opt-in hardware emission layer to the AIE backend, enabling end-to-end execution on the VEK280 board (and hw emulation) alongside the existing aiesim flow.
Includes a hardware-flow tutorial tutorial_3.py in tutorials
Details about the change and execution are in docs/README_PL_integration.md.