Conversation
|
Thanks for doing this @williamsnell! |
|
this is great @williamsnell! I'm wondering if this be exposed as part of the general array configuration, which is where |
|
@d-v-b I've pushed a new commit moving this to One note: I'm not sure how this interacts with Sharding now - following the existing code I hardcoded |
I don't think we want this new configuration option to change the behavior of the sharding codec. A missing subchunk inside a shard is conveyed explicitly via the shard index, so from the sharding codec's POV you can't have a subchunk appear missing due to a network error. |
If I've understood correctly, we'll want to make this tweak to def _get_chunk_spec(self, shard_spec: ArraySpec) -> ArraySpec:
+ # Because the shard index and inner chunks should be stored
+ # together, we detect missing data via the shard index.
+ # The inner chunks defined here are thus allowed to return
+ # None, even if fill_missing_chunks=False at the array level.
+ config = replace(shard_spec.config, fill_missing_chunks=True)
return ArraySpec(
shape=self.chunk_shape,
dtype=shard_spec.dtype,
fill_value=shard_spec.fill_value,
- config=shard_spec.config,
+ config=config,
prototype=shard_spec.prototype,
)With this change, I think my previous point was wrong - we would be able to use |
example in zarr-python zarr-developers#486.
self.supports_partial_decode`.
expected behaviour of fill_missing_chunks for both sharding and write_empty_chunks via tests. Use elif to make control flow slightly clearer.
6db55a1 to
de7afd8
Compare
|
I've committed the change to I've also made two more changes:
|
|
Is there a way to run |
|
a few suggestions about the error class, if it's OK with you @williamsnell I'd be happy pushing those changes to this branch? |
Go for it @d-v-b - thank you! |
yeah |
|
From the discussion above, I'm planning to do the following:
Are we all happy with this approach? Anything I've missed? |
This seems like a good summary! I think max's approach doesn't require breaking any public APIs, but it does change the return type of |
for example, Clients interested in the indices of the missing values can find them, but this is also open to adding additional information in the future by widening the type of |
Speaking as the obviously least qualified zarr contributor in this thread, this sounds good to me 😄. Before I make this change: double-checking the exact type we want to implement here? e.g. Otherwise, I'll merge maxrjones@37a40e3 into this branch, then modify it to return |
|
TBH I don't see value in widening the return type. While marginal compared to I/O, return a "present" status for every chunk adds memory and processing overhead. I think we should only do that if there's practical value from the information, rather than speculatively suggesting that the information could be needed later. |
|
I could see value in extensibility in the return type for missing chunk information. Something like: @dataclass(frozen=True)
class MissingChunkInfo:
key: str
coordinates: tuple[int, ...]With |
By returning a list of ints, the current design is not extensible without breaking downstream consumers. We can't predict how this might be used, so we should aim for a return type that can be extended. I don't think the literal strings are terribly extensible, but a typeddict would be: class ReadResult(TypedDict):
missing: Literal[True, False]This is cheap to construct and gives us as much flexibility as we could possibly need, provided we are OK sticking with a python primitive. |
Compared to IO and compression, the overhead of bookkeeping our reads is negligible. That probably shouldn't weigh too heavily on our design here. |
|
Color me skeptical of returning info about all chunks rather than just missing ones, but I'd be glad to be proven wrong down the line. I support the solution from #3748 (comment) and won't block a dense tuple or set return type. |
|
sorry for the distraction from the core of your PR @williamsnell, we are definitely going to get this merged one way or another. It's just that we are running into a few warts in our basic IO model, which is that so far we have not associated IO operations with any kind of annotations / metadata besides the literal payload when fetching data. This means useful information, like the time required for the read / write, and any other information about the read / write operation the storage backend might know, is totally missing from our stack. Ideally we would have defined these semantics at the |
|
💯 thank you for your patience and working on this @williamsnell 🙇 🙌 |
|
I'm working on merging the changes from upstream/main and @maxrjones's branch. Question: The only name I've found that feels sufficiently explanatory and keeps the Any strong preferences for any of the options ( |
…37a40e3. Update docstrings to match current behaviour. Move description of sharding behaviour to test, now that it has no dedicated codepath.
yeah, I get that. what about specifically out of the options you suggested, I find |
|
I've pushed the combined changes to this branch - thanks all! Note: codec_pipeline.py is now identical to what's on main:
Only in array.py do we actually check the config for |
With the refactors I've just pushed, the most accurate description is definitely |
I think both of these names work, I will submit a third: |
|
Ok, I've gone in circles myself enough with the naming and think all options have their merits. For the sake of keeping momentum, I'll leave the name as is ( With that, I think this PR is ready - but if there's anything else needed, please let me know! |
I think this is fine TBH. Just wanted to reiterate why I mentioned this in the first place (since it hasn't been mentioned): having Very pedantic, and I appear to be in the minority. |
|
i think as long as the name is not catastrophically bad we are fine :) we can always introduce a new config field in the future if we are for some reason deeply dissatisfied with the one here. |
Add config options for whether a missing chunk should:
fill_value(current behaviour; retained as default)MissingChunkErrorThis PR is entirely based on the work of @tomwhite in this issue. I've started this PR as this an important feature that I'd like to see merged.
I've added a test (based on the demo in the issue) and a minor docs tweak.
Questions:
config.md- is this the right place?TODO:
docs/user-guide/*.mdchanges/