Skip to content

Implement Apache Arrow Interface for PODIO #967

@arnavdham

Description

@arnavdham

Overview

This is the parent tracking issue for the Google Summer of Code (GSoC) 2026 project to design and implement an Apache Arrow backend for PODIO.

The goal is to translate YAML-defined Event Data Models (EDM) into Arrow's columnar in-memory format, enabling language-independent, zero-copy data access, and seamless serialization to industry-standard formats like Parquet. This will facilitate high-speed streaming readout and reconstruction frameworks like EICrecon for future experiments.


Roadmap & Deliverables

  • Task 1: Core Schema Mapping (Jinja2 Generators) (Sub-issue: YAML to Arrow Schema Mapping #968 )
    • Define mapping conventions from PODIO YAML to Arrow structures.
    • Implement Jinja2 templates for ArrowMapper.h.
  • Task 2: ArrowWriter Implementation (Conversion, Registry & Relations)
    • Implement dynamic category schema assembly using a global type registry (ArrowTypeRegistry).
    • Implement the ArrowWriter in-memory conversion loop using arrow::Buffer::Wrap for flat wrapping.
    • Map non-contiguous vector members using arrow::ListView (zero-copy) and arrow::ListArray (normalization).
    • Map 1-to-1, 1-to-many, and subset relations to reference structures.
  • Task 3: The Arrow Reader
    • Build ArrowReader to deserialize Arrow Tables back to PODIO Frames.
  • Task 4: Round-Trip Validation
  • Implement full C++ validation tests comparing Frames with those serialized/deserialized via Arrow.
  • Task 5: Parquet & Dataset Persistence
    • Integrate arrow::dataset and the Parquet C++ writer to persist tables as self-describing Parquet files.
  • Task 6: Performance Benchmarking
    • Benchmark I/O throughput (MB/s), file sizes, and memory usage compared to ROOT TTree/RNTuple and SIO.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions