Skip to content

Antalya 26.3: Share S3 client cache per bucket#1807

Open
zvonand wants to merge 2 commits into
antalya-26.3from
feature/antalya-26.3/ClickHouse-ClickHouse-pr-96802
Open

Antalya 26.3: Share S3 client cache per bucket#1807
zvonand wants to merge 2 commits into
antalya-26.3from
feature/antalya-26.3/ClickHouse-ClickHouse-pr-96802

Conversation

@zvonand
Copy link
Copy Markdown
Collaborator

@zvonand zvonand commented May 15, 2026

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Share S3 ClientCache per bucket, reduce repeated region discovery (ClickHouse#96802 by @zvonand).

Cherry-picked from ClickHouse#96802.


Closes ClickHouse#92482

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

zvonand and others added 2 commits May 15, 2026 21:59
…olution in next commit)

---
Original cherry-pick message follows:

Merge d0085a8 into 12f07ad

# Conflicts:
#	src/Disks/DiskObjectStorage/ObjectStorages/S3/diskSettings.cpp
@zvonand zvonand added releasy Created/managed by RelEasy ai-resolved Port conflict auto-resolved by Claude labels May 15, 2026
@github-actions
Copy link
Copy Markdown

Workflow [PR], commit [22afc92]

@zvonand zvonand added backport Backport port-antalya PRs to be ported to all new Antalya releases antalya-26.3 labels May 15, 2026
@zvonand
Copy link
Copy Markdown
Collaborator Author

zvonand commented May 17, 2026

RelEasy analyze-fails — UNRELATED

run completed at 2026-05-17T17:58:38Z

  • Head SHA: 22afc92304 (feature/antalya-26.3/ClickHouse-ClickHouse-pr-96802)
  • Tests considered: 12 across 2 CI shard(s)
  • Outcomes: 0 done · 0 partial · 2 unrelated · 0 unresolved
  • Commits added by AI: 0 (—)
  • Anthropic cost: $4.6859

Per-shard outcomes

⏭️ UNRELATED — Stateless tests (arm_asan, targeted)

1 failed test(s) considered — cost $1.1164
full report

AI narration

Let me check what the failure looks like and verify the build setup.
The single failing test 01171_mv_select_insert_isolation_long failed with "not enough iterations of select_insert_action has been made" (10/15 in two of the eight parallel actions). It's a long-running stress test that:

  • Spawns 8 concurrent transactional workloads against local MergeTree tables and a materialized view
  • Requires each workload to complete ≥15 iterations within a 400-second window
  • Has historically been tuned multiple times for timing (adjust MIN_ITERATIONS, adjust settings, add min time in its git history) and is tagged long, no-ordinary-database, no-encrypted-storage, no-msan for that reason

The actual delta this PR introduces (b38a2e29460 + 22afc923047) is a cherry-pick of ClickHouse#96802 — purely S3 client / diskSettings changes. None of that touches MergeTree, transactions, or materialized views. The failure mode is purely a wall-clock budget miss in a known-flaky stress test on an arm_asan debug build, with no correctness assertion fired — only two of eight parallel workers fell short. I can't write a one-sentence "this PR broke this test because " justification, so per the scoping rule it's out of scope.

No code changes made.
Final summary:

  • 01171_mv_select_insert_isolation_long[unrelated] — long-running stress test (8 parallel transactional workers, 400s wall-clock budget, MIN_ITERATIONS=15). Failed with "not enough iterations of select_insert_action has been made from 33 until 433" — a timing budget miss, not a correctness failure. Test history shows repeated MIN_ITERATIONS/timing tuning, confirming it's a chronic stress-test flake on slow shards (debug+ASan on ARM). This PR's actual delta is a cherry-pick of Share S3 client cache per bucket ClickHouse/ClickHouse#96802 (S3 client / diskSettings only) with no plausible link to local-MergeTree transactional MV throughput.

UNRELATED
Final summary:

  • 01171_mv_select_insert_isolation_long[unrelated] — long-running stress test (8 parallel transactional workers, 400s wall-clock budget, MIN_ITERATIONS=15). Failed with "not enough iterations of select_insert_action has been made from 33 until 433" — a timing budget miss, not a correctness failure. Test history shows repeated MIN_ITERATIONS/timing tuning, confirming it's a chronic stress-test flake on slow shards (debug+ASan on ARM). This PR's actual delta is a cherry-pick of Share S3 client cache per bucket ClickHouse/ClickHouse#96802 (S3 client / diskSettings only) with no plausible link to local-MergeTree transactional MV throughput.

UNRELATED

⏭️ UNRELATED — Integration tests (amd_tsan, 2/6)

11 failed test(s) considered — cost $3.5695
full report

AI narration

I have enough information to classify all failures. Let me summarize my findings.

Investigation summary

The PR (feature/antalya-26.3/ClickHouse-ClickHouse-pr-96802) is a cherry-pick of upstream PR ClickHouse#96802. The actual diff vs origin/antalya-26.3 (the proper base) is only 4 files in src/IO/S3/:

src/Disks/DiskObjectStorage/ObjectStorages/S3/diskSettings.cpp
src/IO/S3/Client.cpp
src/IO/S3/Client.h
src/IO/S3/tests/gtest_aws_s3_client.cpp

(The huge git diff antalya-26.3..HEAD shown by --stat is misleading because the local antalya-26.3 is stale — origin/antalya-26.3 already has the CPULeaseAllocation UAF fix, the scheduler refactors, etc. The merge-base of HEAD vs origin/antalya-26.3 is 59bfcc082dc; the PR only adds 2 commits on top of that.)

None of the 11 failing tests exercise S3 client code:

  • test_scheduler_cpu_preemptive/* — scheduler/CPU lease tests.
  • test_token_roles_mapping/* — HTTP token-based authentication tests.

Root causes:

  • test_create_workload_under_load hits a pre-existing TSan data race in SemaphoreConstraint::active() at src/Common/Scheduler/Nodes/SemaphoreConstraint.h:205 (race between removeChild on the WorkloadResMgr thread and finishRequest on a worker). That file is not in this PR's diff. The TSan abort kills the server, and the cluster fixture is module-scoped, so the other 6 scheduler tests then fail with Connection refused / Attempt to read after eof during the clear_workloads_and_resources teardown.
  • The 4 test_token_roles_mapping tests are annotated flaky-elsewhere on 5 other tracked PRs — strong master-flake/infra signal, and they are unrelated to S3 client code.

No code changes warranted. Since I am not editing any code, no rebuild or re-run is needed (the rules say only re-run CAUSED-BY-THIS-PR tests).

Final report

  • test_scheduler_cpu_preemptive/test.py::test_cpu_time_fairness[fixed_longer_dev][unrelated] collateral damage: server died earlier in the run from a pre-existing TSan race in SemaphoreConstraint::active (file untouched by this S3-only PR); subsequent queries get ATTEMPT_TO_READ_AFTER_EOF while the connection drops.
  • test_scheduler_cpu_preemptive/test.py::test_cpu_time_fairness[random_equal][unrelated] server already dead from the same TSan abort; Connection refused in teardown. Also flaky on another tracked PR.
  • test_scheduler_cpu_preemptive/test.py::test_cpu_time_fairness[random_longer_prd][unrelated] same dead-server cascade; flaky-elsewhere.
  • test_scheduler_cpu_preemptive/test.py::test_cpu_time_fairness[random_longer_dev][unrelated] same dead-server cascade; flaky-elsewhere.
  • test_scheduler_cpu_preemptive/test.py::test_downscaling[cpu-slot-preemption-timeout-1ms][unrelated] same dead-server cascade; flaky-elsewhere.
  • test_scheduler_cpu_preemptive/test.py::test_drop_workload_during_query[unrelated] same dead-server cascade; flaky-elsewhere.
  • test_scheduler_cpu_preemptive/test.py::test_create_workload_under_load[unrelated] originating TSan data race in DB::SemaphoreConstraint::active() between WorkloadResourceManager::Resource::deleteNode (writer) and CPULeaseAllocation::RequestChain::finish (reader). The race site is in src/Common/Scheduler/Nodes/SemaphoreConstraint.h, which this PR (S3 client only) does not touch. Also flaky on another tracked PR.
  • test_token_roles_mapping/test.py::test_mapped_guid_grants_mapped_role[unrelated] HTTP/auth test, no causal connection to S3 client code; failing on 5 other tracked PRs (strong master/infra flake).
  • test_token_roles_mapping/test.py::test_multiple_guids_grant_multiple_roles[unrelated] same; flaky on 5 other PRs.
  • test_token_roles_mapping/test.py::test_unmapped_guid_is_dropped_by_filter[unrelated] same; flaky on 5 other PRs.
  • test_token_roles_mapping/test.py::test_only_unmapped_guids_yield_no_roles[unrelated] same; flaky on 5 other PRs.
git status --porcelain

Tree clean. No commits authored or reverted.

UNRELATED
Tree clean. No commits authored or reverted.

UNRELATED


🤖 Posted automatically by releasy analyze-fails. Re-run the command to refresh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-resolved Port conflict auto-resolved by Claude antalya-26.3 backport Backport port-antalya PRs to be ported to all new Antalya releases releasy Created/managed by RelEasy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants