Fill missing chunks by williamsnell · Pull Request #3748 · zarr-developers/zarr-python

williamsnell · 2026-03-05T05:02:32Z

Add config options for whether a missing chunk should:

appear as a chunk filled with fill_value (current behaviour; retained as default)
raise a MissingChunkError

This PR is entirely based on the work of @tomwhite in this issue. I've started this PR as this an important feature that I'd like to see merged.
I've added a test (based on the demo in the issue) and a minor docs tweak.

Questions:

I've added an example to config.md - is this the right place?

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

tomwhite · 2026-03-05T11:14:33Z

Thanks for doing this @williamsnell!

d-v-b · 2026-03-05T12:16:19Z

this is great @williamsnell! I'm wondering if this be exposed as part of the general array configuration, which is where write_empty_chunks currently sits.

williamsnell · 2026-03-05T20:23:36Z

@d-v-b I've pushed a new commit moving this to ArrayConfig so we can look into the ergonomics. It does feel like a more natural fit, especially since users probably want to set write_empty_chunks and fill_missing_chunks as a pair.

One note: I'm not sure how this interacts with Sharding now - following the existing code I hardcoded fill_missing_chunks=True here - does this imply there's no way to use fill_missing_chunks alongside sharding? If so, I can update the docs.

d-v-b · 2026-03-05T20:29:51Z

One note: I'm not sure how this interacts with Sharding now - following the existing code I hardcoded fill_missing_chunks=True here - does this imply there's no way to use fill_missing_chunks alongside sharding? If so, I can update the docs.

I don't think we want this new configuration option to change the behavior of the sharding codec. A missing subchunk inside a shard is conveyed explicitly via the shard index, so from the sharding codec's POV you can't have a subchunk appear missing due to a network error.

williamsnell · 2026-03-06T07:45:53Z

I don't think we want this new configuration option to change the behavior of the sharding codec. A missing subchunk inside a shard is conveyed explicitly via the shard index, so from the sharding codec's POV you can't have a subchunk appear missing due to a network error.

If I've understood correctly, we'll want to make this tweak to ShardingCodec._get_chunk_spec:

def _get_chunk_spec(self, shard_spec: ArraySpec) -> ArraySpec:
+  # Because the shard index and inner chunks should be stored
+  # together, we detect missing data via the shard index.
+  # The inner chunks defined here are thus allowed to return
+  # None, even if fill_missing_chunks=False at the array level.
+  config = replace(shard_spec.config, fill_missing_chunks=True)
   return ArraySpec(
       shape=self.chunk_shape,
       dtype=shard_spec.dtype,
       fill_value=shard_spec.fill_value,
-      config=shard_spec.config,
+      config=config,
       prototype=shard_spec.prototype,
   )

With this change, I think my previous point was wrong - we would be able to use fill_missing_chunk=False with sharding, and we would only raise an error if an entire shard (specifically its shard index) was missing.

example in zarr-python zarr-developers#486.

self.supports_partial_decode`.

expected behaviour of fill_missing_chunks for both sharding and write_empty_chunks via tests. Use elif to make control flow slightly clearer.

williamsnell · 2026-03-06T20:59:42Z

I've committed the change to _get_chunk_spec and rebased onto main.

I've also made two more changes:

Added some tests to check/codify expected behaviour around:
- when a sharded array should or shouldn't raise MissingChunkError
- how fill_missing_chunks is expected to interact with write_empty_chunks
Simplified the branching in codec_pipeline into if/elif/else.

williamsnell · 2026-03-11T02:50:28Z

Is there a way to run coverage on this PR? I assumed it would pop up automatically, but apparently not.

src/zarr/errors.py

src/zarr/core/codec_pipeline.py

d-v-b · 2026-03-11T08:24:16Z

a few suggestions about the error class, if it's OK with you @williamsnell I'd be happy pushing those changes to this branch?

williamsnell · 2026-03-11T09:05:55Z

a few suggestions about the error class, if it's OK with you @williamsnell I'd be happy pushing those changes to this branch?

Go for it @d-v-b - thank you!

…otFoundError

…/zarr-python into fill-missing-chunks

d-v-b · 2026-03-18T08:36:57Z

as long as the actual error class is exported!

yeah zarr.errors is public

williamsnell · 2026-03-22T19:36:27Z

From the discussion above, I'm planning to do the following:

integrate the changes from maxrjones@37a40e3
rename fill_missing_chunks to read_missing_chunks

Are we all happy with this approach? Anything I've missed?

d-v-b · 2026-03-23T08:27:52Z

From the discussion above, I'm planning to do the following:
1. integrate the changes from [maxrjones@37a40e3](https://github.com/maxrjones/zarr-python/commit/37a40e37f4ec57da56629c3505fdd2265b8d7937)

2. rename `fill_missing_chunks` to `read_missing_chunks`
Are we all happy with this approach? Anything I've missed?

This seems like a good summary! I think max's approach doesn't require breaking any public APIs, but it does change the return type of CodecPipeline.read, which should probably be critically evaluated. We need read() to return something that conveys the missing chunks, but it might be worth checking if list[int] is the best return type for that purpose.

d-v-b · 2026-03-23T08:36:25Z

it might be worth checking if list[int] is the best return type for that purpose.

for example, tuple[ReadResult, ...], where ReadResult is something like this:

ReadResult = Literal["missing", "present"]

Clients interested in the indices of the missing values can find them, but this is also open to adding additional information in the future by widening the type of ReadResult.

williamsnell · 2026-03-24T04:52:59Z

it might be worth checking if list[int] is the best return type for that purpose.

for example, tuple[ReadResult, ...], where ReadResult is something like this:
ReadResult = Literal["missing", "present"]
Clients interested in the indices of the missing values can find them, but this is also open to adding additional information in the future by widening the type of ReadResult.

Speaking as the obviously least qualified zarr contributor in this thread, this sounds good to me 😄.

Before I make this change: double-checking the exact type we want to implement here? e.g. Literal["missing", "present"] vs an enum vs Literal[0, 1] vs something more complex (a ReadResult dataclass with a missing member or something?). I don't imagine this is the most performance critical part of the codebase, but could imagine a world in which code that interacts with the read() interface could be quite opinionated about this type. So double-checking.

Otherwise, I'll merge maxrjones@37a40e3 into this branch, then modify it to return List[ReadResult], ReadResult = Literal["missing", "present"] as you've suggested @d-v-b.

maxrjones · 2026-03-24T14:44:45Z

TBH I don't see value in widening the return type. While marginal compared to I/O, return a "present" status for every chunk adds memory and processing overhead. I think we should only do that if there's practical value from the information, rather than speculatively suggesting that the information could be needed later.

maxrjones · 2026-03-24T14:47:48Z

I could see value in extensibility in the return type for missing chunk information. Something like:

@dataclass(frozen=True)
class MissingChunkInfo:
    key: str
    coordinates: tuple[int, ...]

With read() returning list[MissingChunkInfo]. This gives meaningful extensibility (e.g., a future reason field) without paying per-chunk overhead for present chunks. The caller could then build a comprehensive error message directly from the list without indirecting back through batch_info.

d-v-b · 2026-03-24T14:48:36Z

TBH I don't see value in widening the return type. While marginal compared to I/O, return a "present" status for every chunk adds memory and processing overhead. I think we should only do that if there's practical value from the information, rather than speculatively suggesting that the information could be needed later.

By returning a list of ints, the current design is not extensible without breaking downstream consumers. We can't predict how this might be used, so we should aim for a return type that can be extended.

I don't think the literal strings are terribly extensible, but a typeddict would be:

class ReadResult(TypedDict):
  missing: Literal[True, False]

This is cheap to construct and gives us as much flexibility as we could possibly need, provided we are OK sticking with a python primitive.

d-v-b · 2026-03-24T14:50:17Z

without paying per-chunk overhead for present chunks.

Compared to IO and compression, the overhead of bookkeeping our reads is negligible. That probably shouldn't weigh too heavily on our design here.

maxrjones · 2026-03-24T15:01:19Z

Color me skeptical of returning info about all chunks rather than just missing ones, but I'd be glad to be proven wrong down the line. I support the solution from #3748 (comment) and won't block a dense tuple or set return type.

d-v-b · 2026-03-24T15:04:20Z

sorry for the distraction from the core of your PR @williamsnell, we are definitely going to get this merged one way or another. It's just that we are running into a few warts in our basic IO model, which is that so far we have not associated IO operations with any kind of annotations / metadata besides the literal payload when fetching data.

This means useful information, like the time required for the read / write, and any other information about the read / write operation the storage backend might know, is totally missing from our stack. Ideally we would have defined these semantics at the Store level, and then we could re-use them here, but we haven't gotten around to that yet :/

maxrjones · 2026-03-24T15:33:42Z

💯 thank you for your patience and working on this @williamsnell 🙇 🙌

williamsnell · 2026-03-26T02:38:44Z

I'm working on merging the changes from upstream/main and @maxrjones's branch.

Question: read_missing_chunks feels off to me, now that I'm propagating it throughout the docs. It doesn't clearly convey what reading a missing chunk does.

The only name I've found that feels sufficiently explanatory and keeps the read prefix is read_missing_chunks_as_fill_value. This is wordy but at least explicit.

Any strong preferences for any of the options (fill_missing_chunks, read_missing_chunks, or read_missing_chunks_as_fill_value)?

…37a40e3. Update docstrings to match current behaviour. Move description of sharding behaviour to test, now that it has no dedicated codepath.

maxrjones · 2026-03-26T04:05:01Z

Question: read_missing_chunks feels off to me, now that I'm propagating it throughout the docs. It doesn't clearly convey what reading a missing chunk does.

The only name I've found that feels sufficiently explanatory and keeps the read prefix is read_missing_chunks_as_fill_value. This is wordy but at least explicit.

Any strong preferences for any of the options (fill_missing_chunks, read_missing_chunks, or read_missing_chunks_as_fill_value)?

yeah, I get that. what about raise_on_missing_chunks or error_on_missing_chunks? I think that most directly describes the feature that's been missing. The default would be False, in contrast to the other options which would default to True.

specifically out of the options you suggested, I find fill_missing_chunks the most intuitive.

williamsnell · 2026-03-26T04:05:55Z

I've pushed the combined changes to this branch - thanks all!

Note: codec_pipeline.py is now identical to what's on main:

We always fill chunks with the array fill value.
We always return a GetResult of either "present" or "missing".

Only in array.py do we actually check the config for read_missing_chunks and decide whether to investigate the list[GetResult] and possibly raise an error.

williamsnell · 2026-03-26T04:09:32Z

yeah, I get that. what about raise_on_missing_chunks or error_on_missing_chunks? I think that most directly describes the feature that's been missing. The default would be False, in contrast to the other options which would default to True.

specifically out of the options you suggested, I find fill_missing_chunks the most intuitive.

With the refactors I've just pushed, the most accurate description is definitely raise_on_missing_chunks or error_on_missing_chunks.

d-v-b · 2026-03-26T20:12:47Z

yeah, I get that. what about raise_on_missing_chunks or error_on_missing_chunks? I think that most directly describes the feature that's been missing. The default would be False, in contrast to the other options which would default to True.
specifically out of the options you suggested, I find fill_missing_chunks the most intuitive.

With the refactors I've just pushed, the most accurate description is definitely raise_on_missing_chunks or error_on_missing_chunks.

I think both of these names work, I will submit a third: allow_missing_chunks. But any one of these seems good to me.

williamsnell · 2026-03-27T03:10:31Z

Ok, I've gone in circles myself enough with the naming and think all options have their merits. For the sake of keeping momentum, I'll leave the name as is (read_missing_chunks).

With that, I think this PR is ready - but if there's anything else needed, please let me know!

ilan-gold · 2026-03-27T08:27:46Z

The only name I've found that feels sufficiently explanatory and keeps the read prefix is read_missing_chunks_as_fill_value. This is wordy but at least explicit

I think this is fine TBH. Just wanted to reiterate why I mentioned this in the first place (since it hasn't been mentioned): having write_empty_chunks + fill_missing_chunks to me is weird. What does fill_missing_chunks do if not the same thing as write_empty_chunks?

Very pedantic, and I appear to be in the minority.

d-v-b · 2026-03-27T09:08:13Z

i think as long as the name is not catastrophically bad we are fine :) we can always introduce a new config field in the future if we are for some reason deeply dissatisfied with the one here.

github-actions bot added needs release notes Automatically applied to PRs which haven't added release notes and removed needs release notes Automatically applied to PRs which haven't added release notes labels Mar 5, 2026

tomwhite and others added 8 commits March 7, 2026 09:48

Add codec_pipeline.fill_missing_chunks config

a07764e

Set default for fill_missing_chunks in config.py. Add test replicating

7438a03

example in zarr-python zarr-developers#486.

Add fill_missing_chunks to examples of config options.

38e5acf

Add to /changes

1d51b37

Parameterize tests to make sure we hit both branches of `if

ad3e2ed

self.supports_partial_decode`.

Fix lint errors: remove parentheses, type kwargs.

2c9b31b

Move config from codec_pipeline -> array. Update docs, tests.

2846ed9

Delegate missing-shard detection away from _get_chunk_spec. Codify

de7afd8

expected behaviour of fill_missing_chunks for both sharding and write_empty_chunks via tests. Use elif to make control flow slightly clearer.

williamsnell force-pushed the fill-missing-chunks branch from 6db55a1 to de7afd8 Compare March 6, 2026 20:49

Merge branch 'main' into fill-missing-chunks

233ddce

d-v-b reviewed Mar 11, 2026

View reviewed changes

src/zarr/errors.py Outdated Show resolved Hide resolved

d-v-b reviewed Mar 11, 2026

View reviewed changes

src/zarr/errors.py Outdated Show resolved Hide resolved

d-v-b reviewed Mar 11, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Outdated Show resolved Hide resolved

Merge branch 'main' into fill-missing-chunks

c8b0d11

d-v-b added 4 commits March 11, 2026 12:51

Define ChunkNotFoundError; expose chunk key and chunk index in ChunkN…

d460521

…otFoundError

update docs

9c9a096

Merge branch 'fill-missing-chunks' of https://github.com/williamsnell…

40da713

…/zarr-python into fill-missing-chunks

Merge branch 'main' into fill-missing-chunks

16517d1

d-v-b added the benchmark Code will be benchmarked in a CI job. label Mar 11, 2026

Merge branch 'main' into fill-missing-chunks

404e4ac

d-v-b mentioned this pull request Mar 24, 2026

Return a result from IO operations #3827

Open

d-v-b mentioned this pull request Mar 24, 2026

feat: return a useful value from CodecPipeline.read() #3828

Merged

maxrjones and others added 2 commits March 26, 2026 15:11

Pass chunk indexes up

093c7f4

Merge branch 'main' into fill-missing-chunks

4d8cfe7

williamsnell added 2 commits March 26, 2026 15:41

fill_missing_chunks -> read_missing_chunks

45a68e6

Resolve behavioural differences between main and maxrjones/zarr-python@…

c578231

…37a40e3. Update docstrings to match current behaviour. Move description of sharding behaviour to test, now that it has no dedicated codepath.

Merge branch 'main' into fill-missing-chunks

f01b049

Merge branch 'main' into fill-missing-chunks

4d7d0d2

Merge branch 'main' into fill-missing-chunks

b9e4206

Uh oh!

Conversation

williamsnell commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Questions:

Uh oh!

tomwhite commented Mar 5, 2026

Uh oh!

d-v-b commented Mar 5, 2026

Uh oh!

williamsnell commented Mar 5, 2026

Uh oh!

d-v-b commented Mar 5, 2026

Uh oh!

williamsnell commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

williamsnell commented Mar 6, 2026

Uh oh!

williamsnell commented Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d-v-b commented Mar 11, 2026

Uh oh!

williamsnell commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Mar 18, 2026

Uh oh!

williamsnell commented Mar 22, 2026

Uh oh!

d-v-b commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Mar 23, 2026

Uh oh!

williamsnell commented Mar 24, 2026

Uh oh!

maxrjones commented Mar 24, 2026

Uh oh!

maxrjones commented Mar 24, 2026

Uh oh!

d-v-b commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxrjones commented Mar 24, 2026

Uh oh!

d-v-b commented Mar 24, 2026

Uh oh!

maxrjones commented Mar 24, 2026

Uh oh!

williamsnell commented Mar 26, 2026

Uh oh!

maxrjones commented Mar 26, 2026

Uh oh!

williamsnell commented Mar 26, 2026

Uh oh!

williamsnell commented Mar 26, 2026

Uh oh!

d-v-b commented Mar 26, 2026

Uh oh!

williamsnell commented Mar 27, 2026

Uh oh!

ilan-gold commented Mar 27, 2026

Uh oh!

d-v-b commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

williamsnell commented Mar 5, 2026 •

edited

Loading

williamsnell commented Mar 6, 2026 •

edited

Loading

williamsnell commented Mar 11, 2026 •

edited

Loading

d-v-b commented Mar 23, 2026 •

edited

Loading

d-v-b commented Mar 24, 2026 •

edited

Loading

d-v-b commented Mar 24, 2026 •

edited

Loading

d-v-b commented Mar 27, 2026 •

edited

Loading