Skip to content

High level C Scan API#32

Open
myrrc wants to merge 4 commits intodevelopfrom
myrrc/scan-api-c
Open

High level C Scan API#32
myrrc wants to merge 4 commits intodevelopfrom
myrrc/scan-api-c

Conversation

@myrrc
Copy link
Copy Markdown

@myrrc myrrc commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

@connortsui20 connortsui20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to follow the template a bit more closely? It's less about following the exact template and more that it's a bit hard to understand what it being proposed and the different tradeoffs (alternatives) without some of the sections in the template. Basically, when reading through this I'm trying to figure out the "why" via the "what", and it should be the other way around.

Additionally, there is quite a lot of code here that is paired with high-level description. I think it would be helpful if some of the boilerplate-y code was removed so we could focus on things that are not obvious. That would help in making this easier to read as well.

Also as an aside: I'm not an FFI expert but I feel like there is a lack of ownership semantics described here. Shouldn't that be one of the main things we have to worry about with interop between Rust and C?

@myrrc
Copy link
Copy Markdown
Author

myrrc commented Mar 31, 2026

Is this the way we want to expose a cache?

I've thought about this, and likely we don't want to expose (in-memory) caching as a separate implementation. If host provides us with malloc/free-compatible functions, we can build our cache inside Vortex and host will still have observability.

On persistent caching, this may possibly be useful for remote files, but in this case it should be host's responsibility to save the decoded or original file somewhere on the filesystem and provide us with a reference. This can already be done with file_open (host may get the reference to a cached file).

High level API is missing a way to read a file directly from a buffer, but I didn't add it as I'm not sure this would be used by anyone.

@myrrc myrrc force-pushed the myrrc/scan-api-c branch from 0796aca to ae6ccfd Compare March 31, 2026 11:26
@myrrc myrrc changed the title C Scan API High level C Scan API Apr 1, 2026
@myrrc myrrc force-pushed the myrrc/scan-api-c branch from 1637b3b to 23c6fda Compare April 1, 2026 16:35
Copy link
Copy Markdown
Contributor

@connortsui20 connortsui20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I had a few comments related to some missing detail and structure (and there are some grammar errors that I just left out) but this looks like it is on the right track.

I think it is good we are splitting this out into multiple proposals, but do we have a timeline for when we want the other layers? My guess is as we go down it gets even harder than it already is so it might be worth designing those sooner than later.

Comment on lines +98 to +103
Vortex manages the event loop by using a CurrentThreadRuntime runtime handle
which in turn has a `smol::Executor<'static>` without a background thread pool.
This means the event loop will run on host's thread pool scheduler (in other
terms, use the calling thread) implicitly. In case of multiple threads, the
underlying executor is shared between them, and drives from from all of these
threads. The executor is initialized once on loading FFI library.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want more detail on this (unless this is outside of the scope of this RFC and deserves its own document?).

For example, what does "the underlying executor is shared between them, and drives from from all of these threads" actually mean? So would the host thread hold the executor and then spawn more threads as necessary, or is this actually just a single host thread registers(?) the executor and the other host threads somehow know about this execution thread? Or are all of the threads cooperatively synchronizing the driving of execution where thread A calling vx_partition_next can make progress on tasks spawned by thread B while blocked on IO (like how tokio moves threads around)?

Or does this just not matter at all for the "high-level" layer? In that case maybe this should be moved out of this RFC.

Comment on lines +73 to +74
implemented separately. As this is a big change, this PR will from now on focus
just on the high level scan API. Other levels may be implemented later on
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the top so that readers know this is only talking about the high-level layer?

Current implementation copy-pastes ArrowSchema and ArrowArrayStream and gates
them behind a macro.

## High level API integration example: DuckDB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this RFC just focuses on the top layer, could you move this to the beginning of the RFC? Also, there doesn't actually seem to be that much different before and after, is the only difference rust vtable vs c ffi? And what is the benefit of (what I am assuming is the case right now) dropping the cxx bindings here? Is it just simplicity and maintenance surface area?

Comment on lines +254 to +256
// special case: row_range_begin == 0 && row_range_end == 0 means "all rows"
uint64_t row_range_begin; // Inclusive
uint64_t row_range_end; // Exclusive
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a footgun... What about schema-only queries? I think it would be safer to have a flag for all_rows or something


| Level | Allocations | IO | Event loop | Scan planning |
| ----- | ----------- | -- | ---------- | ------------- |
| High | Host/Vortex | Host/Vortex | Vortex | Vortex |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT there isn't any description of what host vs vortex allocations and IO looks like in this RFC? A brief description somewhere would be nice. For example, what is the API in which the host choose to make allocations themselves vs Vortex? And I guess if the fs callback function pointers are null we know that Vortex has to manage IO?

Edit: I just saw the comment below where it says that custom memory allocators are outside the scope of this RFC, could you move that up here too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants