Antalya 26.3: Share S3 client cache per bucket by zvonand · Pull Request #1807 · Altinity/ClickHouse

zvonand · 2026-05-15T20:04:24Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Share S3 ClientCache per bucket, reduce repeated region discovery (ClickHouse#96802 by @zvonand).

Cherry-picked from ClickHouse#96802.

Closes ClickHouse#92482

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

…olution in next commit) --- Original cherry-pick message follows: Merge d0085a8 into 12f07ad # Conflicts: # src/Disks/DiskObjectStorage/ObjectStorages/S3/diskSettings.cpp

github-actions · 2026-05-15T20:05:30Z

Workflow [PR], commit [22afc92]

zvonand · 2026-05-17T17:58:40Z

RelEasy `analyze-fails` — UNRELATED

run completed at 2026-05-17T17:58:38Z

Head SHA: 22afc92304 (feature/antalya-26.3/ClickHouse-ClickHouse-pr-96802)
Tests considered: 12 across 2 CI shard(s)
Outcomes: 0 done · 0 partial · 2 unrelated · 0 unresolved
Commits added by AI: 0 (—)
Anthropic cost: $4.6859

Per-shard outcomes

⏭️ UNRELATED — `Stateless tests (arm_asan, targeted)`

1 failed test(s) considered — cost $1.1164
full report

AI narration

Let me check what the failure looks like and verify the build setup.
The single failing test 01171_mv_select_insert_isolation_long failed with "not enough iterations of select_insert_action has been made" (10/15 in two of the eight parallel actions). It's a long-running stress test that:

Spawns 8 concurrent transactional workloads against local MergeTree tables and a materialized view
Requires each workload to complete ≥15 iterations within a 400-second window
Has historically been tuned multiple times for timing (adjust MIN_ITERATIONS, adjust settings, add min time in its git history) and is tagged long, no-ordinary-database, no-encrypted-storage, no-msan for that reason

The actual delta this PR introduces (b38a2e29460 + 22afc923047) is a cherry-pick of ClickHouse#96802 — purely S3 client / diskSettings changes. None of that touches MergeTree, transactions, or materialized views. The failure mode is purely a wall-clock budget miss in a known-flaky stress test on an arm_asan debug build, with no correctness assertion fired — only two of eight parallel workers fell short. I can't write a one-sentence "this PR broke this test because " justification, so per the scoping rule it's out of scope.

No code changes made.
Final summary:

01171_mv_select_insert_isolation_long — [unrelated] — long-running stress test (8 parallel transactional workers, 400s wall-clock budget, MIN_ITERATIONS=15). Failed with "not enough iterations of select_insert_action has been made from 33 until 433" — a timing budget miss, not a correctness failure. Test history shows repeated MIN_ITERATIONS/timing tuning, confirming it's a chronic stress-test flake on slow shards (debug+ASan on ARM). This PR's actual delta is a cherry-pick of Share S3 client cache per bucket ClickHouse/ClickHouse#96802 (S3 client / diskSettings only) with no plausible link to local-MergeTree transactional MV throughput.

UNRELATED
Final summary:

01171_mv_select_insert_isolation_long — [unrelated] — long-running stress test (8 parallel transactional workers, 400s wall-clock budget, MIN_ITERATIONS=15). Failed with "not enough iterations of select_insert_action has been made from 33 until 433" — a timing budget miss, not a correctness failure. Test history shows repeated MIN_ITERATIONS/timing tuning, confirming it's a chronic stress-test flake on slow shards (debug+ASan on ARM). This PR's actual delta is a cherry-pick of Share S3 client cache per bucket ClickHouse/ClickHouse#96802 (S3 client / diskSettings only) with no plausible link to local-MergeTree transactional MV throughput.

UNRELATED

⏭️ UNRELATED — `Integration tests (amd_tsan, 2/6)`

11 failed test(s) considered — cost $3.5695
full report

AI narration

I have enough information to classify all failures. Let me summarize my findings.

Investigation summary

The PR (feature/antalya-26.3/ClickHouse-ClickHouse-pr-96802) is a cherry-pick of upstream PR ClickHouse#96802. The actual diff vs origin/antalya-26.3 (the proper base) is only 4 files in src/IO/S3/:

src/Disks/DiskObjectStorage/ObjectStorages/S3/diskSettings.cpp
src/IO/S3/Client.cpp
src/IO/S3/Client.h
src/IO/S3/tests/gtest_aws_s3_client.cpp

(The huge git diff antalya-26.3..HEAD shown by --stat is misleading because the local antalya-26.3 is stale — origin/antalya-26.3 already has the CPULeaseAllocation UAF fix, the scheduler refactors, etc. The merge-base of HEAD vs origin/antalya-26.3 is 59bfcc082dc; the PR only adds 2 commits on top of that.)

None of the 11 failing tests exercise S3 client code:

test_scheduler_cpu_preemptive/* — scheduler/CPU lease tests.
test_token_roles_mapping/* — HTTP token-based authentication tests.

Root causes:

test_create_workload_under_load hits a pre-existing TSan data race in SemaphoreConstraint::active() at src/Common/Scheduler/Nodes/SemaphoreConstraint.h:205 (race between removeChild on the WorkloadResMgr thread and finishRequest on a worker). That file is not in this PR's diff. The TSan abort kills the server, and the cluster fixture is module-scoped, so the other 6 scheduler tests then fail with Connection refused / Attempt to read after eof during the clear_workloads_and_resources teardown.
The 4 test_token_roles_mapping tests are annotated flaky-elsewhere on 5 other tracked PRs — strong master-flake/infra signal, and they are unrelated to S3 client code.

No code changes warranted. Since I am not editing any code, no rebuild or re-run is needed (the rules say only re-run CAUSED-BY-THIS-PR tests).

Final report

test_scheduler_cpu_preemptive/test.py::test_cpu_time_fairness[fixed_longer_dev] — [unrelated] collateral damage: server died earlier in the run from a pre-existing TSan race in SemaphoreConstraint::active (file untouched by this S3-only PR); subsequent queries get ATTEMPT_TO_READ_AFTER_EOF while the connection drops.
test_scheduler_cpu_preemptive/test.py::test_cpu_time_fairness[random_equal] — [unrelated] server already dead from the same TSan abort; Connection refused in teardown. Also flaky on another tracked PR.
test_scheduler_cpu_preemptive/test.py::test_cpu_time_fairness[random_longer_prd] — [unrelated] same dead-server cascade; flaky-elsewhere.
test_scheduler_cpu_preemptive/test.py::test_cpu_time_fairness[random_longer_dev] — [unrelated] same dead-server cascade; flaky-elsewhere.
test_scheduler_cpu_preemptive/test.py::test_downscaling[cpu-slot-preemption-timeout-1ms] — [unrelated] same dead-server cascade; flaky-elsewhere.
test_scheduler_cpu_preemptive/test.py::test_drop_workload_during_query — [unrelated] same dead-server cascade; flaky-elsewhere.
test_scheduler_cpu_preemptive/test.py::test_create_workload_under_load — [unrelated] originating TSan data race in DB::SemaphoreConstraint::active() between WorkloadResourceManager::Resource::deleteNode (writer) and CPULeaseAllocation::RequestChain::finish (reader). The race site is in src/Common/Scheduler/Nodes/SemaphoreConstraint.h, which this PR (S3 client only) does not touch. Also flaky on another tracked PR.
test_token_roles_mapping/test.py::test_mapped_guid_grants_mapped_role — [unrelated] HTTP/auth test, no causal connection to S3 client code; failing on 5 other tracked PRs (strong master/infra flake).
test_token_roles_mapping/test.py::test_multiple_guids_grant_multiple_roles — [unrelated] same; flaky on 5 other PRs.
test_token_roles_mapping/test.py::test_unmapped_guid_is_dropped_by_filter — [unrelated] same; flaky on 5 other PRs.
test_token_roles_mapping/test.py::test_only_unmapped_guids_yield_no_roles — [unrelated] same; flaky on 5 other PRs.

git status --porcelain

Tree clean. No commits authored or reverted.

UNRELATED
Tree clean. No commits authored or reverted.

UNRELATED

🤖 Posted automatically by releasy analyze-fails. Re-run the command to refresh.

zvonand and others added 2 commits May 15, 2026 21:59

Cherry-pick of ClickHouse#96802 with unresolved conflict markers (res…

b38a2e2

…olution in next commit) --- Original cherry-pick message follows: Merge d0085a8 into 12f07ad # Conflicts: # src/Disks/DiskObjectStorage/ObjectStorages/S3/diskSettings.cpp

Resolve conflicts in cherry-pick of ClickHouse#96802

22afc92

zvonand added releasy Created/managed by RelEasy ai-resolved Port conflict auto-resolved by Claude labels May 15, 2026

zvonand added backport Backport port-antalya PRs to be ported to all new Antalya releases antalya-26.3 labels May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Antalya 26.3: Share S3 client cache per bucket#1807

Antalya 26.3: Share S3 client cache per bucket#1807
zvonand wants to merge 2 commits into
antalya-26.3from
feature/antalya-26.3/ClickHouse-ClickHouse-pr-96802

zvonand commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

zvonand commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zvonand commented May 15, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

zvonand commented May 17, 2026

RelEasy analyze-fails — UNRELATED

Per-shard outcomes

⏭️ UNRELATED — Stateless tests (arm_asan, targeted)

⏭️ UNRELATED — Integration tests (amd_tsan, 2/6)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RelEasy `analyze-fails` — UNRELATED

⏭️ UNRELATED — `Stateless tests (arm_asan, targeted)`

⏭️ UNRELATED — `Integration tests (amd_tsan, 2/6)`