Skip to content

[SPARK-56123][PYTHON] Refactor SQL_GROUPED_AGG_ARROW_UDF and SQL_GROUPED_AGG_ARROW_ITER_UDF#54992

Closed
Yicong-Huang wants to merge 2 commits intoapache:masterfrom
Yicong-Huang:SPARK-56123/refactor/grouped-agg-arrow-udf
Closed

[SPARK-56123][PYTHON] Refactor SQL_GROUPED_AGG_ARROW_UDF and SQL_GROUPED_AGG_ARROW_ITER_UDF#54992
Yicong-Huang wants to merge 2 commits intoapache:masterfrom
Yicong-Huang:SPARK-56123/refactor/grouped-agg-arrow-udf

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Mar 24, 2026

What changes were proposed in this pull request?

Refactor SQL_GROUPED_AGG_ARROW_UDF and SQL_GROUPED_AGG_ARROW_ITER_UDF to use ArrowStreamSerializer as a pure I/O layer, moving all processing logic into read_udfs() in worker.py.

Why are the changes needed?

Part of SPARK-55388.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

ASV micro-benchmarks with repeat=(3, 5, 5.0) show consistent improvements:

SQL_GROUPED_AGG_ARROW_UDF (time)

Scenario UDF Before After Change
few_groups_sm sum 8.32ms 7.28ms -12%
few_groups_sm mean_multi 7.38ms 6.01ms -19%
few_groups_lg sum 29.7ms 29.7ms 0%
few_groups_lg mean_multi 30.4ms 29.9ms -2%
many_groups_sm sum 231ms 181ms -22%
many_groups_sm mean_multi 206ms 144ms -30%
many_groups_lg sum 109ms 95.9ms -12%
many_groups_lg mean_multi 107ms 85.0ms -21%
wide_cols sum 75.9ms 59.9ms -21%
wide_cols mean_multi 74.7ms 57.1ms -24%

SQL_GROUPED_AGG_ARROW_ITER_UDF (time)

Scenario UDF Before After Change
few_groups_sm sum 6.41ms 5.66ms -12%
few_groups_sm mean_multi 5.37ms 4.63ms -14%
few_groups_lg sum 18.2ms 17.9ms -2%
few_groups_lg mean_multi 20.7ms 20.3ms -2%
many_groups_sm sum 190ms 166ms -13%
many_groups_sm mean_multi 156ms 131ms -16%
many_groups_lg sum 78.4ms 72.8ms -7%
many_groups_lg mean_multi 71.1ms 65.5ms -8%
wide_cols sum 49.1ms 44.0ms -10%
wide_cols mean_multi 45.7ms 39.8ms -13%

Peak memory: No change (~1.15G for all scenarios).

Was this patch authored or co-authored using generative AI tooling?

No.

@Yicong-Huang Yicong-Huang changed the title [SPARK-56123][PYTHON] Refactor SQL_GROUPED_AGG_ARROW_UDF and SQL_GROUPED_AGG_ARROW_ITER_UDF to use ArrowStreamSerializer [SPARK-56123][PYTHON] Refactor SQL_GROUPED_AGG_ARROW_UDF and SQL_GROUPED_AGG_ARROW_ITER_UDF Mar 24, 2026
@Yicong-Huang Yicong-Huang marked this pull request as draft March 24, 2026 23:21
@Yicong-Huang Yicong-Huang force-pushed the SPARK-56123/refactor/grouped-agg-arrow-udf branch 5 times, most recently from a15c5d1 to 827d633 Compare March 25, 2026 17:24
@Yicong-Huang
Copy link
Contributor Author

This PR depends on #54967 (enforce_schema). Will rebase after that one merges.

@Yicong-Huang Yicong-Huang force-pushed the SPARK-56123/refactor/grouped-agg-arrow-udf branch from 827d633 to 1119500 Compare March 26, 2026 06:35
@Yicong-Huang Yicong-Huang force-pushed the SPARK-56123/refactor/grouped-agg-arrow-udf branch from d7c3fac to e1f6f32 Compare March 26, 2026 07:17
@Yicong-Huang Yicong-Huang marked this pull request as ready for review March 26, 2026 17:05
@zhengruifeng
Copy link
Contributor

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants