LSDB Operations Implementation by dougbrn · Pull Request #1359 · astronomy-commons/lsdb

dougbrn · 2026-05-04T19:24:44Z

Motivation

This PR implements a sizable rewrite of LSDB's backend relationship with Dask. Where previously, LSDB was built directly on top of Dask Dataframe and much of LSDBs functionality relied on dask primitives to dictate graph construction (for the most part, in some cases we moved to custom delayed-style graphs for better control). With LSDB Operations, LSDB takes complete control of graph construction through it's own custom set of Operations.

The driving motivation being that we have observed graph sizes/construction time to be a clear limiter on scalability, and when relying on Dask for graph construction we have several bottlenecks/fall over modes related to tasks getting too much information stored in them and pre-culled graph sizes being large. However, in principle we know exactly what the optimized graph should look like for nearly all LSDB workflows. Workflows either involved spatial partition matching, which we know exactly which partitions should be interwoven, or they are simple map_partitions style work. LSDB Operations are designed to arrive at this optimal graph directly.

The exact results depend on the workflow, but as a rough estimate we expect a ~3-5x improvement in number of graph tasks and graph size in memory. This can have downstream speedups in actual computation, simply by avoiding the Dask per task overhead (~1ms) through reducing the total number of tasks.

API Changes

The major change is that the ._ddf property, which accessed the underlying Dask Dataframe directly is no longer available, because there is no longer an underlying Dask DataFrame!

The public API is for the most part untouched, but there are exceptions:

Additions:

Adds an explicit Catalog.to_dask_dataframe method as a replacement for the loss of ._ddf
Adds Catalog.exploded_columns (previously only available through ._ddf)

Removals:

Catalog.merge is removed as it was just a wrapper for Dask's merge, use it directly in Dask Catalog.to_dask_dataframe().merge
remove HealpixDataset.get_partition_index

Behavior Changes:

map_partitions meta has been expanded to accept more input data types (series or dict), and coerces results more heavily into dataframes (which return as catalogs)
HealpixDataset.get_partition returns HealpixDataset not ddf
.partitions supports healpix pixels as well as indices
to_delayed defaults to optimize=false instead of true
HealpixDataset.partitions is now iterable
HealpixDataset.sample no longer computes multiple times
HealpixDataset.random_sample doesn’t use delayed
prune_empty_partitions no longer has persist option
Column selection: Selecting a single column (e.g. catalog["my_column"]) now returns a catalog rather than a dd.series, however single column syntax is still usable for mask-style filtering (e.g. catalog[catalog["my_column"] == True]
No longer have a nest accessor (for nested columns) available from a catalog column

Technical State

Unit test suite has been fully adapted to operations and is now all passing. Mostly only mypy failures now, some of which actually currently exist in the codebase as this PR introduces a pandas typing dev dependency that gives mypy additional typing context that it previously lacked.

Docs have been reviewed (one pass) both for code changes (._ddf replacement) and language consistency with operations. Static notebooks were checked for obvious language/code changes, but have not been rerun.

The operations branch was tested on a focus week sprint, which is useful to mention just to illustrate it's been tested outside of the scope of our CI.

Future Considerations

The recently added LSDB Streaming implementation had some nice performance gains over the current LSDB (particularly by avoiding the large pre-culled graph sizes present in pre-operations LSDB), but potentially less so when compared against operations. While this is nice, we should look further into optimizing streaming for operations.
Query optimization: We have enough control over the graphs to be able to retroactively optimize our graph execution, for example loading all columns but only providing a workflow that works with 1 of them. We didn't want to include this in our initial implementation but it is worth investigating in the near future.
Series Integration: We currently sort of replicate some series behavior such as column selection, mask filtering, or map_partitions returning a scalar, with single column data frames. We may want to consider switching to either our own Operations backed Series or integrating with Dask Series. Some exploration on this has been done on the sean/series branch

review-notebook-app · 2026-05-04T19:24:49Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

* wip * fix search and crossmatch unit tests * fix join

* fix est size * fix other unit tests

* add verify meta to ops * improve how verify_meta works * fix failing tests

* add boolean catalog filter logic * add partitioning test

delucchi-cmu

I'm only like 15% through looking at it, but this is enough to make me very happy:

There are just so many cases where the change cleans up existing interfaces. I'm sure there's sadness coming, so this is probably a nice place to leave it for today.

One major thing this PR needs is a longer description. We'll need to be clear about breaking changes, the motivation, what is NOT being implemented yet, etc. It's going to show up quite a bit in blame =D

delucchi-cmu · 2026-06-25T14:24:35Z

+
+
+def test_coerce_to_meta_unsupported_scalar_type_raises_type_error():
+    # KNOWN ISSUE: an object of a type pandas doesn't recognize as a dtype


So if the dataframe contains a row with HealpixPixel, we'll get a type error? I like to do that kinda thing A LOT in my homework pipelines, when there's some summary statistics I'm creating for each pixel. Or am I misreading this note?

This test is for _coerce_to_meta, which is used at the end of a map_partitions call to coerce the result structure into a dataframe. So if you return something with a HealpixPixel intended as a column, then that should be okay. But if you have a map_partitions call that returns a HealpixPixel, then that would fail and you would need to wrap it in a supported return type (like {"pixel_result": my_HealpixPixel}. Does that seem reasonable?

Could we coerce any unsupported dtype to an object column? I think that's what dask map_partitions would do currently and make it a series with a single element per partition, and I agree it's something I use currently

Gotcha, that seems like what we want. I've updated the method to now have fallbacks to the np.object type whenever it tries to cast the type as a pandas type and fails.

Co-authored-by: Melissa DeLucchi <113376043+delucchi-cmu@users.noreply.github.com>

smcguire-cmu · 2026-06-30T20:35:52Z

+    @overload
+    def map_partitions(
+        self,
+        func: Callable[..., pd.DataFrame],
+        *args: Any,
+        meta: pd.DataFrame | pd.Series | dict | Iterable | tuple | None = None,
+        include_pixel: bool = False,
+        compute_single_partition: bool = False,
+        partition_index: int | HealpixPixel | None = None,
+        **kwargs: Any,
+    ) -> Self: ...
+    @overload
+    def map_partitions(
+        self,
+        func: Callable[..., pd.Series],
+        *args: Any,
+        meta: pd.DataFrame | pd.Series | dict | Iterable | tuple | None = None,
+        include_pixel: bool = False,
+        compute_single_partition: bool = False,
+        partition_index: int | HealpixPixel | None = None,
+        **kwargs: Any,
+    ) -> Self | dd.Series: ...
+    def map_partitions(
+        self,
+        func: Callable[..., npd.NestedFrame],
+        *args,
+        meta: pd.DataFrame | pd.Series | dict | Iterable | tuple | None = None,
+        include_pixel: bool = False,
+        compute_single_partition: bool = False,
+        partition_index: int | HealpixPixel | None = None,
+        **kwargs,
+    ) -> Self | dd.Series:


We don't actually ever return a dd.Series anymore, right? If it's always a catalog we can remove the overloads and change the type hints to remove the types within the Callable from func and remove the dd.Series return type.

smcguire-cmu · 2026-06-30T20:36:26Z

+        Self or dd.Series
+            A new catalog with each partition replaced with the output of the function applied to the original
+            partition. If the function returns a non dataframe output, a dask Series will be returned.


Same return type issue as above

smcguire-cmu · 2026-06-30T20:45:49Z

+        if not isinstance(new_op.meta, pd.DataFrame):
+            warnings.warn(
+                "output of the function must be a DataFrame to generate an LSDB `Catalog`. "
+                "`map_partitions` will return a dask object instead of a Catalog.",
+                RuntimeWarning,
+            )
+            return new_cat.to_dask_dataframe()


How does this work with our meta inferencing logic? Is this reachable?

smcguire-cmu · 2026-06-30T20:46:47Z

+        self, op_class: type[MapPartitions], func, *args, meta=None, **kwargs
+    ) -> Self:
+        new_op = op_class(self._operation, func, *args, meta=meta, **kwargs)
+        return self._create_updated_dataset(op=new_op)


This is used in get_item but not updated in the Catalog subclass, so the margin will never be updated. We should override this with margin logic in the subclass, and potentially change map_partitions to use it so we don't have to overwrite map_partitions in Catalog.

smcguire-cmu · 2026-07-01T21:57:50Z

            if new_dec_col != dec_col:
                updated_params["dec_column"] = new_dec_col
-        return self._create_updated_dataset(ddf=ndf, updated_catalog_info_params=updated_params)
+        new_op = MapPartitions(self._operation, lambda df: df.rename(columns=columns))


This should use the self.map_partitions so the margin also gets updated in the Catalog

smcguire-cmu · 2026-07-01T22:09:47Z

-        if len(self.dataframe.nested_columns) > 0:
-            ddf = ddf.astype({col: self.dataframe[col].dtype for col in self.dataframe.nested_columns})


Is this no longer needed?

Co-authored-by: Sean McGuire <123987820+smcguire-cmu@users.noreply.github.com>

github-actions · 2026-07-01T22:47:45Z

Before [`4e1fdfc`]	After [`605740d`]	Ratio	Benchmark (Parameter)
8.71±0.06s	3.85±0.01s	~0.44	benchmarks.time_lazy_crossmatch_many_columns_overlapping_suffixes
9.56±0.1s	9.02±0.04s	0.94	benchmarks.time_save_big_catalog
159±4ms	125±1ms	0.79	benchmarks.time_open_many_columns_list
44.1±1ms	31.5±0.7ms	0.71	benchmarks.time_polygon_search
340±6ms	232±2ms	0.68	benchmarks.time_open_many_columns_default
26.1±0.9ms	14.1±0.2ms	0.54	benchmarks.time_box_filter_on_partition
3.09±0.02s	1.51±0.01s	0.49	benchmarks.time_open_many_columns_all
8.48±0.02s	3.83±0.01s	0.45	benchmarks.time_lazy_crossmatch_many_columns_all_suffixes
162±2ms	68.0±0.5ms	0.42	benchmarks.time_kdtree_crossmatch
1.01±0.01s	170±1ms	0.17	benchmarks.time_create_midsize_catalog

Click here to view all benchmarks.

smcguire-cmu and others added 29 commits June 10, 2026 09:20

Start changing ddf to Operation

a5f9d7a

Finish changing healpix_dataset to use operation

3bc19fa

remove bogus import

210bb5d

fix map_rows op

c8c7263

WIP catalog & tests

f803d18

wip sync

6dbbc27

Change merge functions to use operations (#1355)

6700d4b

Update from_dataframe to use Operations (#1358)

799f724

wip

0ef061f

more tests passing

b639fac

more meta coercion

7384e1a

Fix get_arrow_schema usage

c29e08a

fix alignandapply

af4163e

map_partitions meta normalization

9049f2d

more unit test fixes

0934740

fix key name overlap bug

3c22ed2

add func to tokenize

6246883

minor

be3c830

update docstrings

200ff63

remove nd.NestedFrame references outside of lsdb.nested

98a6af9

eject lsdb.nested; migrate generation to catalog sub-module

84f79cd

nested tests passing

9d0ac04

remove merge

983f82a

operations-friendly streaming

4b6cf1d

Progress on unit tests (#1396)

c15cea0

* wip * fix search and crossmatch unit tests * fix join

fix loader tests

eb31f21

Sean/ops tests (#1405)

9f52986

* fix est size * fix other unit tests

Add verify meta to lsdb_ops (#1407)

6cf2a7d

* add verify meta to ops * improve how verify_meta works * fix failing tests

add boolean catalog filter logic (#1414)

391dd32

* add boolean catalog filter logic * add partitioning test

dougbrn added 2 commits June 18, 2026 13:50

add pandas-stubs for mypy

ffc4c43

rewrite intro to reflect new dask relationship

38c2127

dougbrn changed the title ~~[DO NOT MERGE] LSDB Operations Implementation~~ LSDB Operations Implementation Jun 24, 2026

dougbrn marked this pull request as ready for review June 24, 2026 16:51

dougbrn requested a review from delucchi-cmu June 24, 2026 16:52

delucchi-cmu reviewed Jun 24, 2026

View reviewed changes

Comment thread src/lsdb/catalog/dataset/healpix_dataset.py Outdated

Comment thread src/lsdb/catalog/dataset/healpix_dataset.py

Comment thread src/lsdb/catalog/dataset/healpix_dataset.py

Comment thread src/lsdb/operations/lsdb_ops.py

Comment thread src/lsdb/operations/lsdb_ops.py

delucchi-cmu reviewed Jun 25, 2026

View reviewed changes

dougbrn and others added 7 commits June 25, 2026 14:35

Update src/lsdb/loaders/dataframe/margin_catalog_generator.py

c35beef

Co-authored-by: Melissa DeLucchi <113376043+delucchi-cmu@users.noreply.github.com>

Update src/lsdb/operations/functions/concat_catalog_data.py

8813157

Co-authored-by: Melissa DeLucchi <113376043+delucchi-cmu@users.noreply.github.com>

Update src/lsdb/operations/lsdb_ops.py

9d6f9de

Co-authored-by: Melissa DeLucchi <113376043+delucchi-cmu@users.noreply.github.com>

Update src/lsdb/operations/lsdb_ops.py

a6c3a4e

Co-authored-by: Melissa DeLucchi <113376043+delucchi-cmu@users.noreply.github.com>

address review comments

62cbf24

add comments on filtering and comparisons

4caa6bf

Merge branch 'main' into operations

d42843c

nevencaplar reviewed Jun 26, 2026

View reviewed changes

Comment thread src/lsdb/loaders/hats/read_hats.py Outdated

remove dupes

9d13fa1

smcguire-cmu mentioned this pull request Jun 29, 2026

Consider replacing schema with a derived property #1443

Open

dougbrn mentioned this pull request Jun 29, 2026

Investigate Streaming Graph Construction #1444

Open

fallback to object types

8045b42

dougbrn requested a review from smcguire-cmu June 30, 2026 22:01

dougbrn mentioned this pull request Jul 1, 2026

[DO NOT MERGE] Operations-optimal Streaming #1450

Draft

smcguire-cmu reviewed Jul 1, 2026

View reviewed changes

dougbrn mentioned this pull request Jul 1, 2026

Improve operations-backed catalog repr #1456

Open

dougbrn and others added 2 commits July 1, 2026 15:24

Update src/lsdb/catalog/dataset/healpix_dataset.py

69dc6ab

Co-authored-by: Sean McGuire <123987820+smcguire-cmu@users.noreply.github.com>

properly check meta not catalog

3aa2f89



		def test_coerce_to_meta_unsupported_scalar_type_raises_type_error():
		# KNOWN ISSUE: an object of a type pandas doesn't recognize as a dtype

		if len(self.dataframe.nested_columns) > 0:
		ddf = ddf.astype({col: self.dataframe[col].dtype for col in self.dataframe.nested_columns})

Uh oh!

Conversation

dougbrn commented May 4, 2026 • edited by smcguire-cmu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

API Changes

Technical State

Future Considerations

Uh oh!

review-notebook-app Bot commented May 4, 2026

Uh oh!

delucchi-cmu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smcguire-cmu Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dougbrn commented May 4, 2026 •

edited by smcguire-cmu

Loading

smcguire-cmu Jun 29, 2026 •

edited

Loading