Conversation
connortsui20
left a comment
There was a problem hiding this comment.
Is it possible to follow the template a bit more closely? It's less about following the exact template and more that it's a bit hard to understand what it being proposed and the different tradeoffs (alternatives) without some of the sections in the template. Basically, when reading through this I'm trying to figure out the "why" via the "what", and it should be the other way around.
Additionally, there is quite a lot of code here that is paired with high-level description. I think it would be helpful if some of the boilerplate-y code was removed so we could focus on things that are not obvious. That would help in making this easier to read as well.
Also as an aside: I'm not an FFI expert but I feel like there is a lack of ownership semantics described here. Shouldn't that be one of the main things we have to worry about with interop between Rust and C?
I've thought about this, and likely we don't want to expose (in-memory) caching as a separate implementation. If host provides us with malloc/free-compatible functions, we can build our cache inside Vortex and host will still have observability. On persistent caching, this may possibly be useful for remote files, but in this case it should be host's responsibility to save the decoded or original file somewhere on the filesystem and provide us with a reference. This can already be done with file_open (host may get the reference to a cached file). High level API is missing a way to read a file directly from a buffer, but I didn't add it as I'm not sure this would be used by anyone. |
connortsui20
left a comment
There was a problem hiding this comment.
Looks good! I had a few comments related to some missing detail and structure (and there are some grammar errors that I just left out) but this looks like it is on the right track.
I think it is good we are splitting this out into multiple proposals, but do we have a timeline for when we want the other layers? My guess is as we go down it gets even harder than it already is so it might be worth designing those sooner than later.
| Vortex manages the event loop by using a CurrentThreadRuntime runtime handle | ||
| which in turn has a `smol::Executor<'static>` without a background thread pool. | ||
| This means the event loop will run on host's thread pool scheduler (in other | ||
| terms, use the calling thread) implicitly. In case of multiple threads, the | ||
| underlying executor is shared between them, and drives from from all of these | ||
| threads. The executor is initialized once on loading FFI library. |
There was a problem hiding this comment.
I think we might want more detail on this (unless this is outside of the scope of this RFC and deserves its own document?).
For example, what does "the underlying executor is shared between them, and drives from from all of these threads" actually mean? So would the host thread hold the executor and then spawn more threads as necessary, or is this actually just a single host thread registers(?) the executor and the other host threads somehow know about this execution thread? Or are all of the threads cooperatively synchronizing the driving of execution where thread A calling vx_partition_next can make progress on tasks spawned by thread B while blocked on IO (like how tokio moves threads around)?
Or does this just not matter at all for the "high-level" layer? In that case maybe this should be moved out of this RFC.
| implemented separately. As this is a big change, this PR will from now on focus | ||
| just on the high level scan API. Other levels may be implemented later on |
There was a problem hiding this comment.
Can you move this to the top so that readers know this is only talking about the high-level layer?
| Current implementation copy-pastes ArrowSchema and ArrowArrayStream and gates | ||
| them behind a macro. | ||
|
|
||
| ## High level API integration example: DuckDB |
There was a problem hiding this comment.
Now that this RFC just focuses on the top layer, could you move this to the beginning of the RFC? Also, there doesn't actually seem to be that much different before and after, is the only difference rust vtable vs c ffi? And what is the benefit of (what I am assuming is the case right now) dropping the cxx bindings here? Is it just simplicity and maintenance surface area?
| // special case: row_range_begin == 0 && row_range_end == 0 means "all rows" | ||
| uint64_t row_range_begin; // Inclusive | ||
| uint64_t row_range_end; // Exclusive |
There was a problem hiding this comment.
This seems like a footgun... What about schema-only queries? I think it would be safer to have a flag for all_rows or something
|
|
||
| | Level | Allocations | IO | Event loop | Scan planning | | ||
| | ----- | ----------- | -- | ---------- | ------------- | | ||
| | High | Host/Vortex | Host/Vortex | Vortex | Vortex | |
There was a problem hiding this comment.
AFAICT there isn't any description of what host vs vortex allocations and IO looks like in this RFC? A brief description somewhere would be nice. For example, what is the API in which the host choose to make allocations themselves vs Vortex? And I guess if the fs callback function pointers are null we know that Vortex has to manage IO?
Edit: I just saw the comment below where it says that custom memory allocators are outside the scope of this RFC, could you move that up here too?
High level API draft: vortex-data/vortex#7212
Duckdb integration PoC: https://github.com/vortex-data/vortex/tree/myrrc/scan-api-duckdb