remove orphan files#3361
Conversation
c166ff0 to
a4a0f6c
Compare
|
Hi @rambleraptor, thanks for the PR, quick comment, it may make sense to link that this Closes #1200. I do see a related PR #1958, but it looks potentially abandoned (CC: @jayceslesar). NOTE: I notice that you call it remove orphan files, while the linked issue is called delete orphan files. The Java code itself calls it |
|
please i need this, i am writing using duckdb, but they don't support any maintenance and i am drowning in old data :) |
Rationale for this change
This adds support for the RemoveOrphanFiles metadata maintenance task. The goal is to match the Java implementation.
I had to add a list method to FileIO in order to fully implement this. I can separate that work into a separate PR if that's more useful.
A good follow-up would be to wire this into the CLI. Doing these ad-hoc actions without having to write a script / spin up a Spark cluster is a huge win!
Are these changes tested?
I did some local testing where I took a table with orphaned files and tried both the Java/PyIceberg implementations against it. Results were the same.
There's also plenty of tests.
Are there any user-facing changes?