Antalya 26.3: Parallelize reads from a single Parquet file in StorageFile by zvonand · Pull Request #1806 · Altinity/ClickHouse

zvonand · 2026-05-15T19:58:42Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Reading a single large local Parquet file via file() / File engine is now parallelised across multiple sources, each handling a subset of row groups. This eliminates a Resize 1 → N bottleneck in the pipeline and brings single-file ClickBench performance close to the partitioned variant — Q23 goes from ~1.4s to ~0.55s, Q22 from ~0.9s to ~0.48s, Q27 from ~1.6s to ~0.54s on 96 vCPUs (ClickHouse#104251 by @alexey-milovidov).

Cherry-picked from ClickHouse#104251.

On ClickBench, single-file Parquet runs are 3–9× slower than the 100-file partitioned runs on the same data (e.g. on c7a.metal-48xl, Q23 is 8.90s vs 0.99s, Q22 1.82s vs 0.41s, Q27 1.21s vs 0.45s). The cause is in StorageFile: when reading a single splittable file it creates exactly one ParquetV3BlockInputFormat source, so the pipeline becomes File 0 → 1 followed by Resize 1 → 96. That fan-out is a serialization point — every chunk has to leave the single source through one read before any of the 96 aggregators can touch it, so most cores sit idle.

The bucket-splitting machinery (ParquetBucketSplitter, setBucketsToRead, FileBucketInfo) already existed for cluster mode but was never wired into StorageFile. This PR wires it in:

New IBucketSplitter::splitToBucketsByCount returning roughly N contiguous row-group ranges; Parquet implements it.
New FormatFactory::checkFormatHasSplitter so callers can probe without throwing.
StorageFile::ReadFromFile::initializePipeline, when reading exactly one local splittable file, asks the splitter for max_num_streams buckets and creates one StorageFileSource per bucket. Each source carries fixed_file_path + file_bucket_info and skips the shared FilesIterator.
ParquetV3BlockInputFormat::read honours buckets_to_read in the trivial-count path so each bucket only reports its own row count.
The count cache (keyed by file path) is bypassed for bucketed reads — otherwise every bucket would report the file's total and counts would be multiplied by the number of buckets.

Pipeline becomes File × N 0 → 1 straight into the aggregators, matching the partitioned variant.

Results

96-vCPU box, hits.parquet (14 GiB, 226 row groups):

	Single (master)	Single (this PR)	Partitioned
Q21	0.40–0.66s	0.44s	0.34s
Q22	0.93–1.36s	0.48s	0.41s
Q23	1.33–1.45s	0.55s	0.42s
Q26	0.50s	0.35s	0.19s
Q27	1.6s	0.54s	0.45s

CPU utilisation on Q23 jumped from ~6× to ~18× of 96 cores. Aggregate results (count, sum(UserID), sum(length(URL)), Q21, Q23) match the partitioned variant exactly. The remaining ~1.3× gap to partitioned is per-source initialization overhead: each bucket source still reads the 14 GB file's footer separately. Sharing parsed metadata for local files is the obvious next step but a much bigger change.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

…solution in next commit) --- Original cherry-pick message follows: Merge pull request ClickHouse#104251 from alexey-milovidov/parquet-single-file-parallelism Parallelize reads from a single Parquet file in StorageFile # Conflicts: # src/Processors/Formats/Impl/ParquetV3BlockInputFormat.cpp # src/Processors/Formats/Impl/ParquetV3BlockInputFormat.h

github-actions · 2026-05-15T20:03:18Z

Workflow [PR], commit [b08d6ae]

zvonand · 2026-05-17T17:49:31Z

RelEasy `analyze-fails` — UNRELATED

run completed at 2026-05-17T17:49:29Z

Head SHA: b08d6ae98f (feature/antalya-26.3/ClickHouse-ClickHouse-pr-104251)
Tests considered: 7 across 4 CI shard(s)
Outcomes: 0 done · 0 partial · 4 unrelated · 0 unresolved
Commits added by AI: 0 (—)
Anthropic cost: $2.3273

Per-shard outcomes

⏭️ UNRELATED — `Stateless tests (amd_tsan, s3 storage, sequential, 1/2)`

1 failed test(s) considered — cost $0.4303
full report

AI narration

The single failing test is 03760_backup_tar_archive. The failure annotation says it's failing on 4 other tracked PRs (strong prior for unrelated). The test creates 500 partitions then runs BACKUP/RESTORE with a tar disk and timed out at 300s on RESTORE. The PR's diff doesn't touch backup/restore plumbing (only an unrelated formatQueryImpl change in ASTBackupQuery.cpp for BACKUP FROM SNAPSHOT, which doesn't affect tar archive I/O or restore timing), nor the tar archive code, nor the test file itself. This is a known slow/flaky test unrelated to the cherry-pick.

No code changes are warranted.

Summary

03760_backup_tar_archive — [unrelated] — annotated as failing on 4 other tracked PRs (master-side flake); 300s timeout on RESTORE TABLE ... FROM Disk('backups', '03760_backup_tar_archive.tar'). The PR diff doesn't touch backup/restore execution, tar archive code, or this test. The only ASTBackupQuery.cpp change is a formatter branch for BACKUP FROM SNAPSHOT, which doesn't affect this test's query path or runtime.

UNRELATED
The single failing test is 03760_backup_tar_archive. The failure annotation says it's failing on 4 other tracked PRs (strong prior for unrelated). The test creates 500 partitions then runs BACKUP/RESTORE with a tar disk and timed out at 300s on RESTORE. The PR's diff doesn't touch backup/restore plumbing (only an unrelated formatQueryImpl change in ASTBackupQuery.cpp for BACKUP FROM SNAPSHOT, which doesn't affect tar archive I/O or restore timing), nor the tar archive code, nor the test file itself. This is a known slow/flaky test unrelated to the cherry-pick.

No code changes are warranted.

Summary

03760_backup_tar_archive — [unrelated] — annotated as failing on 4 other tracked PRs (master-side flake); 300s timeout on RESTORE TABLE ... FROM Disk('backups', '03760_backup_tar_archive.tar'). The PR diff doesn't touch backup/restore execution, tar archive code, or this test. The only ASTBackupQuery.cpp change is a formatter branch for BACKUP FROM SNAPSHOT, which doesn't affect this test's query path or runtime.

UNRELATED

⏭️ UNRELATED — `Stateless tests (arm_binary, sequential)`

1 failed test(s) considered — cost $0.7426
full report

AI narration

Based on my analysis:

The failing test: 00157_cache_dictionary timed out at 300s during INSERT INTO test.hits_1m SELECT * FROM test.hits LIMIT 1000000.

The PR scope: Cherry-pick of upstream PR ClickHouse#104251, which parallelizes Parquet reads in StorageFile. The diff touches Parquet code, OAuth/JWT, and various unrelated areas — none directly related to MergeTree INSERT performance.

Why this is not caused by the PR:

The test was not modified by this PR.
The query that timed out is a simple INSERT INTO ... SELECT * FROM test.hits LIMIT 1000000 with max_threads=1, max_insert_threads=1 — pure MergeTree insert path, no Parquet, no StorageFile, no MVs.
The test is tagged Heavy, no-tsan, no-msan, no-asan, no-parallel and has a long history of slowness commits (Disable for asan too (too slow), Exclude 00157_cache_dictionary from msan, Relax tests a little, set index granularity).
Relevant changes in the PR that could touch the INSERT path are inert here:
- InterpreterInsertQuery.cpp:770 — the new branch isViewsInvolved() ? max_threads : max_insert_threads. The test has no MVs and sets both max_threads=1 and max_insert_threads=1, so the result of setNumThreads is 1 either way.
- MergedData.cpp changes touch merge algorithms with ColumnReplicated paths — only relevant for merges with JOIN-produced replicated columns, which this insert doesn't have.
- MergeTask.cpp only changes one mutation snapshot param computation per merge — negligible overhead, and irrelevant to the INSERT itself.
The shard is arm_binary, sequential (non-sanitizer), suggesting a CI machine variance / load fluctuation pushed this known-slow test over its 300s budget.

No code change is justified. I have not written a single sentence of "this PR broke this test because " — because I can't, because it didn't.

Final summary:

00157_cache_dictionary — [unrelated] — Heavy stateful insert test (INSERT 1M rows FROM test.hits) that timed out at 300s. The query path (MergeTree INSERT with max_threads=1, max_insert_threads=1, no MVs) is not affected by this PR's diff (Parquet StorageFile, OAuth/JWT, etc.). The test has a documented history of slowness across builds (sanitizer exclusions, no-parallel: Heavy tag, repeated relaxation commits) and a tight, hard-coded 300s max_execution_time budget. Likely CI machine variance on the ARM runner.

UNRELATED
Based on my analysis:

The failing test: 00157_cache_dictionary timed out at 300s during INSERT INTO test.hits_1m SELECT * FROM test.hits LIMIT 1000000.

The PR scope: Cherry-pick of upstream PR ClickHouse#104251, which parallelizes Parquet reads in StorageFile. The diff touches Parquet code, OAuth/JWT, and various unrelated areas — none directly related to MergeTree INSERT performance.

Why this is not caused by the PR:

The test was not modified by this PR.
The query that timed out is a simple INSERT INTO ... SELECT * FROM test.hits LIMIT 1000000 with max_threads=1, max_insert_threads=1 — pure MergeTree insert path, no Parquet, no StorageFile, no MVs.
The test is tagged Heavy, no-tsan, no-msan, no-asan, no-parallel and has a long history of slowness commits (Disable for asan too (too slow), Exclude 00157_cache_dictionary from msan, Relax tests a little, set index granularity).
Relevant changes in the PR that could touch the INSERT path are inert here:
- InterpreterInsertQuery.cpp:770 — the new branch isViewsInvolved() ? max_threads : max_insert_threads. The test has no MVs and sets both max_threads=1 and max_insert_threads=1, so the result of setNumThreads is 1 either way.
- MergedData.cpp changes touch merge algorithms with ColumnReplicated paths — only relevant for merges with JOIN-produced replicated columns, which this insert doesn't have.
- MergeTask.cpp only changes one mutation snapshot param computation per merge — negligible overhead, and irrelevant to the INSERT itself.
The shard is arm_binary, sequential (non-sanitizer), suggesting a CI machine variance / load fluctuation pushed this known-slow test over its 300s budget.

No code change is justified. I have not written a single sentence of "this PR broke this test because " — because I can't, because it didn't.

Final summary:

00157_cache_dictionary — [unrelated] — Heavy stateful insert test (INSERT 1M rows FROM test.hits) that timed out at 300s. The query path (MergeTree INSERT with max_threads=1, max_insert_threads=1, no MVs) is not affected by this PR's diff (Parquet StorageFile, OAuth/JWT, etc.). The test has a documented history of slowness across builds (sanitizer exclusions, no-parallel: Heavy tag, repeated relaxation commits) and a tight, hard-coded 300s max_execution_time budget. Likely CI machine variance on the ARM runner.

UNRELATED

⏭️ UNRELATED — `Integration tests (amd_msan, 2/6)`

1 failed test(s) considered — cost $0.4483
full report

AI narration

Looking at the failing test against the PR's diff:

Test: test_cpu_time_fairness[fixed_longer_prd] — timed out (>900s) in thread.join() at line 333, called from development.stop() at line 426.

Annotation: flaky-elsewhere on 2 other tracked PRs.

PR's changes to this area:

CPULeaseAllocation.cpp/.h: bug fix to capture stable ThreadGroup counters before wait_timer is constructed (avoids UAF on thread-local profile events).
test_scheduler_cpu_preemptive/test.py: only adds a ConcurrencyControlWaitMicroseconds > 0 assertion inside test_independent_pools — not inside the failing test_cpu_time_fairness.

Verdict: The failing test was not modified by this PR. Its failure mode is a pytest-timeout while joining the workload threads after stop_event is set. Under MSan, numbers_mt(4_000_000_000) queries run very slowly, and each worker thread only checks stop_event after the in-flight query finishes — so a query that was just dispatched at the moment of stop() blocks the join for a long time. The PR's CPULeaseAllocation fix is a UAF guard on the wait timer counters and does not plausibly hang the scheduler.

The flaky-elsewhere signal (2 other unrelated PRs) plus the timeout-during-join symptom under MSan match a pre-existing slow-environment flake, not a regression from this diff. No code changes are warranted.

git status --porcelain

Final summary

test_scheduler_cpu_preemptive/test.py::test_cpu_time_fairness[fixed_longer_prd] — [unrelated] — also failing on 2 other tracked PRs (master flake). The test was not modified by this PR; the PR only adds an assertion inside test_independent_pools, a different test. Failure mode is a 900s pytest-timeout in thread.join waiting for numbers_mt(4_000_000_000) queries to drain — consistent with slow MSan runtime, not with the PR's CPULeaseAllocation UAF fix (which would manifest as a sanitizer report or exception, not a hang).

UNRELATED

Final summary

test_scheduler_cpu_preemptive/test.py::test_cpu_time_fairness[fixed_longer_prd] — [unrelated] — also failing on 2 other tracked PRs (master flake). The test was not modified by this PR; the PR only adds an assertion inside test_independent_pools, a different test. Failure mode is a 900s pytest-timeout in thread.join waiting for numbers_mt(4_000_000_000) queries to drain — consistent with slow MSan runtime, not with the PR's CPULeaseAllocation UAF fix (which would manifest as a sanitizer report or exception, not a hang).

UNRELATED

⏭️ UNRELATED — `Integration tests (amd_tsan, 2/6)`

4 failed test(s) considered — cost $0.7061
full report

AI narration

The actual PR diff (since the last merge) only touches Parquet/StorageFile parallelization code — none of it has any connection to the Access, TokenAccessStorage, ExternalAuthenticators, JWT, or HTTP authentication code paths that the test_token_roles_mapping tests exercise.

Let me verify the test history one more layer to be sure this PR isn't somehow at fault.
Confirmed: this PR's diff (since the last merge into the branch) contains zero changes to src/Access/, src/Server/HTTP/, or the test directory. The PR only modifies Parquet single-file parallelism (FormatFactory, ParquetBlockInputFormat, ParquetV3BlockInputFormat, Parquet/ReadManager, StorageFile, IInputFormat).

The 4 failing tests exercise JWT token authentication and <roles_mapping> in TokenAccessStorage — code paths this PR doesn't touch. Combined with the flaky-elsewhere annotation (failing on 5 other tracked PRs), this is conclusively a master/branch-side issue, not caused by this PR.

No code changes made.

Final summary

test_token_roles_mapping/test.py::test_mapped_guid_grants_mapped_role — [unrelated] — this PR only touches Parquet single-file parallelism (cherry-pick of upstream Parallelize reads from a single Parquet file in StorageFile ClickHouse/ClickHouse#104251); failure is in JWT/token auth code which this PR doesn't touch. Also flaky on 5 other tracked PRs.
test_token_roles_mapping/test.py::test_multiple_guids_grant_multiple_roles — [unrelated] — same reason as above; tests <roles_mapping> in TokenAccessStorage which this PR doesn't modify. Flaky on 5 other tracked PRs.
test_token_roles_mapping/test.py::test_unmapped_guid_is_dropped_by_filter — [unrelated] — same reason; tests roles_filter interaction with JWT processor. Flaky on 5 other tracked PRs.
test_token_roles_mapping/test.py::test_only_unmapped_guids_yield_no_roles — [unrelated] — same reason; tests JWT auth user auto-creation. Flaky on 5 other tracked PRs.

UNRELATED
Confirmed: this PR's diff (since the last merge into the branch) contains zero changes to src/Access/, src/Server/HTTP/, or the test directory. The PR only modifies Parquet single-file parallelism (FormatFactory, ParquetBlockInputFormat, ParquetV3BlockInputFormat, Parquet/ReadManager, StorageFile, IInputFormat).

The 4 failing tests exercise JWT token authentication and <roles_mapping> in TokenAccessStorage — code paths this PR doesn't touch. Combined with the flaky-elsewhere annotation (failing on 5 other tracked PRs), this is conclusively a master/branch-side issue, not caused by this PR.

No code changes made.

Final summary

test_token_roles_mapping/test.py::test_mapped_guid_grants_mapped_role — [unrelated] — this PR only touches Parquet single-file parallelism (cherry-pick of upstream Parallelize reads from a single Parquet file in StorageFile ClickHouse/ClickHouse#104251); failure is in JWT/token auth code which this PR doesn't touch. Also flaky on 5 other tracked PRs.
test_token_roles_mapping/test.py::test_multiple_guids_grant_multiple_roles — [unrelated] — same reason as above; tests <roles_mapping> in TokenAccessStorage which this PR doesn't modify. Flaky on 5 other tracked PRs.
test_token_roles_mapping/test.py::test_unmapped_guid_is_dropped_by_filter — [unrelated] — same reason; tests roles_filter interaction with JWT processor. Flaky on 5 other tracked PRs.
test_token_roles_mapping/test.py::test_only_unmapped_guids_yield_no_roles — [unrelated] — same reason; tests JWT auth user auto-creation. Flaky on 5 other tracked PRs.

UNRELATED

🤖 Posted automatically by releasy analyze-fails. Re-run the command to refresh.

alexey-milovidov and others added 2 commits May 15, 2026 21:49

Resolve conflicts in cherry-pick of ClickHouse#104251

b08d6ae

zvonand added releasy Created/managed by RelEasy ai-resolved Port conflict auto-resolved by Claude labels May 15, 2026

zvonand added the antalya-26.3 label May 15, 2026

svb-alt added the backport Backport label May 16, 2026

zvonand added the verified Approved for release label May 18, 2026

zvonand merged commit ab23373 into antalya-26.3 May 18, 2026
547 of 581 checks passed

svb-alt added the antalya label May 19, 2026

zvonand added the port-antalya PRs to be ported to all new Antalya releases label May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Antalya 26.3: Parallelize reads from a single Parquet file in StorageFile#1806

Antalya 26.3: Parallelize reads from a single Parquet file in StorageFile#1806
zvonand merged 2 commits into
antalya-26.3from
feature/antalya-26.3/ClickHouse-ClickHouse-pr-104251

zvonand commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

zvonand commented May 17, 2026

Summary

Summary

Final summary

Final summary

Final summary

Final summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zvonand commented May 15, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Results

Documentation entry for user-facing changes

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

zvonand commented May 17, 2026

RelEasy analyze-fails — UNRELATED

Per-shard outcomes

⏭️ UNRELATED — Stateless tests (amd_tsan, s3 storage, sequential, 1/2)

Summary

Summary

⏭️ UNRELATED — Stateless tests (arm_binary, sequential)

⏭️ UNRELATED — Integration tests (amd_msan, 2/6)

Final summary

Final summary

⏭️ UNRELATED — Integration tests (amd_tsan, 2/6)

Final summary

Final summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RelEasy `analyze-fails` — UNRELATED

⏭️ UNRELATED — `Stateless tests (amd_tsan, s3 storage, sequential, 1/2)`

⏭️ UNRELATED — `Stateless tests (arm_binary, sequential)`

⏭️ UNRELATED — `Integration tests (amd_msan, 2/6)`

⏭️ UNRELATED — `Integration tests (amd_tsan, 2/6)`