Skip to content

Fill missing chunks#3748

Open
williamsnell wants to merge 29 commits intozarr-developers:mainfrom
williamsnell:fill-missing-chunks
Open

Fill missing chunks#3748
williamsnell wants to merge 29 commits intozarr-developers:mainfrom
williamsnell:fill-missing-chunks

Conversation

@williamsnell
Copy link
Copy Markdown

@williamsnell williamsnell commented Mar 5, 2026

Add config options for whether a missing chunk should:

  • appear as a chunk filled with fill_value (current behaviour; retained as default)
  • raise a MissingChunkError

This PR is entirely based on the work of @tomwhite in this issue. I've started this PR as this an important feature that I'd like to see merged.
I've added a test (based on the demo in the issue) and a minor docs tweak.

Questions:

  • I've added an example to config.md - is this the right place?

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added needs release notes Automatically applied to PRs which haven't added release notes and removed needs release notes Automatically applied to PRs which haven't added release notes labels Mar 5, 2026
@tomwhite
Copy link
Copy Markdown
Member

tomwhite commented Mar 5, 2026

Thanks for doing this @williamsnell!

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 5, 2026

this is great @williamsnell! I'm wondering if this be exposed as part of the general array configuration, which is where write_empty_chunks currently sits.

@williamsnell
Copy link
Copy Markdown
Author

@d-v-b I've pushed a new commit moving this to ArrayConfig so we can look into the ergonomics. It does feel like a more natural fit, especially since users probably want to set write_empty_chunks and fill_missing_chunks as a pair.

One note: I'm not sure how this interacts with Sharding now - following the existing code I hardcoded fill_missing_chunks=True here - does this imply there's no way to use fill_missing_chunks alongside sharding? If so, I can update the docs.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 5, 2026

One note: I'm not sure how this interacts with Sharding now - following the existing code I hardcoded fill_missing_chunks=True here - does this imply there's no way to use fill_missing_chunks alongside sharding? If so, I can update the docs.

I don't think we want this new configuration option to change the behavior of the sharding codec. A missing subchunk inside a shard is conveyed explicitly via the shard index, so from the sharding codec's POV you can't have a subchunk appear missing due to a network error.

@williamsnell
Copy link
Copy Markdown
Author

williamsnell commented Mar 6, 2026

I don't think we want this new configuration option to change the behavior of the sharding codec. A missing subchunk inside a shard is conveyed explicitly via the shard index, so from the sharding codec's POV you can't have a subchunk appear missing due to a network error.

If I've understood correctly, we'll want to make this tweak to ShardingCodec._get_chunk_spec:

def _get_chunk_spec(self, shard_spec: ArraySpec) -> ArraySpec:
+  # Because the shard index and inner chunks should be stored
+  # together, we detect missing data via the shard index.
+  # The inner chunks defined here are thus allowed to return
+  # None, even if fill_missing_chunks=False at the array level.
+  config = replace(shard_spec.config, fill_missing_chunks=True)
   return ArraySpec(
       shape=self.chunk_shape,
       dtype=shard_spec.dtype,
       fill_value=shard_spec.fill_value,
-      config=shard_spec.config,
+      config=config,
       prototype=shard_spec.prototype,
   )

With this change, I think my previous point was wrong - we would be able to use fill_missing_chunk=False with sharding, and we would only raise an error if an entire shard (specifically its shard index) was missing.

@williamsnell williamsnell force-pushed the fill-missing-chunks branch from 6db55a1 to de7afd8 Compare March 6, 2026 20:49
@williamsnell
Copy link
Copy Markdown
Author

I've committed the change to _get_chunk_spec and rebased onto main.

I've also made two more changes:

  • Added some tests to check/codify expected behaviour around:
    • when a sharded array should or shouldn't raise MissingChunkError
    • how fill_missing_chunks is expected to interact with write_empty_chunks
  • Simplified the branching in codec_pipeline into if/elif/else.

@williamsnell
Copy link
Copy Markdown
Author

Is there a way to run coverage on this PR? I assumed it would pop up automatically, but apparently not.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 11, 2026

a few suggestions about the error class, if it's OK with you @williamsnell I'd be happy pushing those changes to this branch?

@williamsnell
Copy link
Copy Markdown
Author

williamsnell commented Mar 11, 2026

a few suggestions about the error class, if it's OK with you @williamsnell I'd be happy pushing those changes to this branch?

Go for it @d-v-b - thank you!

@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Mar 11, 2026
@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 18, 2026

as long as the actual error class is exported!

yeah zarr.errors is public

@williamsnell
Copy link
Copy Markdown
Author

From the discussion above, I'm planning to do the following:

  1. integrate the changes from maxrjones@37a40e3
  2. rename fill_missing_chunks to read_missing_chunks

Are we all happy with this approach? Anything I've missed?

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 23, 2026

From the discussion above, I'm planning to do the following:

1. integrate the changes from [maxrjones@37a40e3](https://github.com/maxrjones/zarr-python/commit/37a40e37f4ec57da56629c3505fdd2265b8d7937)

2. rename `fill_missing_chunks` to `read_missing_chunks`

Are we all happy with this approach? Anything I've missed?

This seems like a good summary! I think max's approach doesn't require breaking any public APIs, but it does change the return type of CodecPipeline.read, which should probably be critically evaluated. We need read() to return something that conveys the missing chunks, but it might be worth checking if list[int] is the best return type for that purpose.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 23, 2026

it might be worth checking if list[int] is the best return type for that purpose.

for example, tuple[ReadResult, ...], where ReadResult is something like this:

ReadResult = Literal["missing", "present"]

Clients interested in the indices of the missing values can find them, but this is also open to adding additional information in the future by widening the type of ReadResult.

@williamsnell
Copy link
Copy Markdown
Author

it might be worth checking if list[int] is the best return type for that purpose.

for example, tuple[ReadResult, ...], where ReadResult is something like this:

ReadResult = Literal["missing", "present"]

Clients interested in the indices of the missing values can find them, but this is also open to adding additional information in the future by widening the type of ReadResult.

Speaking as the obviously least qualified zarr contributor in this thread, this sounds good to me 😄.

Before I make this change: double-checking the exact type we want to implement here? e.g. Literal["missing", "present"] vs an enum vs Literal[0, 1] vs something more complex (a ReadResult dataclass with a missing member or something?). I don't imagine this is the most performance critical part of the codebase, but could imagine a world in which code that interacts with the read() interface could be quite opinionated about this type. So double-checking.

Otherwise, I'll merge maxrjones@37a40e3 into this branch, then modify it to return List[ReadResult], ReadResult = Literal["missing", "present"] as you've suggested @d-v-b.

@maxrjones
Copy link
Copy Markdown
Member

TBH I don't see value in widening the return type. While marginal compared to I/O, return a "present" status for every chunk adds memory and processing overhead. I think we should only do that if there's practical value from the information, rather than speculatively suggesting that the information could be needed later.

@maxrjones
Copy link
Copy Markdown
Member

I could see value in extensibility in the return type for missing chunk information. Something like:

@dataclass(frozen=True)
class MissingChunkInfo:
    key: str
    coordinates: tuple[int, ...]

With read() returning list[MissingChunkInfo]. This gives meaningful extensibility (e.g., a future reason field) without paying per-chunk overhead for present chunks. The caller could then build a comprehensive error message directly from the list without indirecting back through batch_info.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 24, 2026

TBH I don't see value in widening the return type. While marginal compared to I/O, return a "present" status for every chunk adds memory and processing overhead. I think we should only do that if there's practical value from the information, rather than speculatively suggesting that the information could be needed later.

By returning a list of ints, the current design is not extensible without breaking downstream consumers. We can't predict how this might be used, so we should aim for a return type that can be extended.

I don't think the literal strings are terribly extensible, but a typeddict would be:

class ReadResult(TypedDict):
  missing: Literal[True, False]

This is cheap to construct and gives us as much flexibility as we could possibly need, provided we are OK sticking with a python primitive.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 24, 2026

without paying per-chunk overhead for present chunks.

Compared to IO and compression, the overhead of bookkeeping our reads is negligible. That probably shouldn't weigh too heavily on our design here.

@maxrjones
Copy link
Copy Markdown
Member

Color me skeptical of returning info about all chunks rather than just missing ones, but I'd be glad to be proven wrong down the line. I support the solution from #3748 (comment) and won't block a dense tuple or set return type.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 24, 2026

sorry for the distraction from the core of your PR @williamsnell, we are definitely going to get this merged one way or another. It's just that we are running into a few warts in our basic IO model, which is that so far we have not associated IO operations with any kind of annotations / metadata besides the literal payload when fetching data.

This means useful information, like the time required for the read / write, and any other information about the read / write operation the storage backend might know, is totally missing from our stack. Ideally we would have defined these semantics at the Store level, and then we could re-use them here, but we haven't gotten around to that yet :/

@maxrjones
Copy link
Copy Markdown
Member

💯 thank you for your patience and working on this @williamsnell 🙇 🙌

@williamsnell
Copy link
Copy Markdown
Author

I'm working on merging the changes from upstream/main and @maxrjones's branch.

Question: read_missing_chunks feels off to me, now that I'm propagating it throughout the docs. It doesn't clearly convey what reading a missing chunk does.

The only name I've found that feels sufficiently explanatory and keeps the read prefix is read_missing_chunks_as_fill_value. This is wordy but at least explicit.

Any strong preferences for any of the options (fill_missing_chunks, read_missing_chunks, or read_missing_chunks_as_fill_value)?

…37a40e3.

Update docstrings to match current behaviour. Move description of
sharding behaviour to test, now that it has no dedicated codepath.
@maxrjones
Copy link
Copy Markdown
Member

Question: read_missing_chunks feels off to me, now that I'm propagating it throughout the docs. It doesn't clearly convey what reading a missing chunk does.

The only name I've found that feels sufficiently explanatory and keeps the read prefix is read_missing_chunks_as_fill_value. This is wordy but at least explicit.

Any strong preferences for any of the options (fill_missing_chunks, read_missing_chunks, or read_missing_chunks_as_fill_value)?

yeah, I get that. what about raise_on_missing_chunks or error_on_missing_chunks? I think that most directly describes the feature that's been missing. The default would be False, in contrast to the other options which would default to True.

specifically out of the options you suggested, I find fill_missing_chunks the most intuitive.

@williamsnell
Copy link
Copy Markdown
Author

I've pushed the combined changes to this branch - thanks all!

Note: codec_pipeline.py is now identical to what's on main:

  • We always fill chunks with the array fill value.
  • We always return a GetResult of either "present" or "missing".

Only in array.py do we actually check the config for read_missing_chunks and decide whether to investigate the list[GetResult] and possibly raise an error.

@williamsnell
Copy link
Copy Markdown
Author

yeah, I get that. what about raise_on_missing_chunks or error_on_missing_chunks? I think that most directly describes the feature that's been missing. The default would be False, in contrast to the other options which would default to True.

specifically out of the options you suggested, I find fill_missing_chunks the most intuitive.

With the refactors I've just pushed, the most accurate description is definitely raise_on_missing_chunks or error_on_missing_chunks.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 26, 2026

yeah, I get that. what about raise_on_missing_chunks or error_on_missing_chunks? I think that most directly describes the feature that's been missing. The default would be False, in contrast to the other options which would default to True.
specifically out of the options you suggested, I find fill_missing_chunks the most intuitive.

With the refactors I've just pushed, the most accurate description is definitely raise_on_missing_chunks or error_on_missing_chunks.

I think both of these names work, I will submit a third: allow_missing_chunks. But any one of these seems good to me.

@williamsnell
Copy link
Copy Markdown
Author

Ok, I've gone in circles myself enough with the naming and think all options have their merits. For the sake of keeping momentum, I'll leave the name as is (read_missing_chunks).

With that, I think this PR is ready - but if there's anything else needed, please let me know!

@ilan-gold
Copy link
Copy Markdown
Contributor

The only name I've found that feels sufficiently explanatory and keeps the read prefix is read_missing_chunks_as_fill_value. This is wordy but at least explicit

I think this is fine TBH. Just wanted to reiterate why I mentioned this in the first place (since it hasn't been mentioned): having write_empty_chunks + fill_missing_chunks to me is weird. What does fill_missing_chunks do if not the same thing as write_empty_chunks?

Very pedantic, and I appear to be in the minority.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 27, 2026

i think as long as the name is not catastrophically bad we are fine :) we can always introduce a new config field in the future if we are for some reason deeply dissatisfied with the one here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants