fix: deduplicate consumer groups to prevent scrape failure (#335)#336
Open
lu-you wants to merge 1 commit into
Open
fix: deduplicate consumer groups to prevent scrape failure (#335)#336lu-you wants to merge 1 commit into
lu-you wants to merge 1 commit into
Conversation
…data#335) During a coordinator migration (triggered when large consumer groups rebalance), franz-go's ListGroups fans out to all brokers via allBrokersShardedReq and merges responses by simple concatenation. Both the old and new coordinator may transiently return the same group, producing duplicate group IDs. These duplicates flow into DescribeGroups, causing group-level metrics (consumer_group_info, consumer_group_members) to be emitted more than once per scrape cycle. The Prometheus registry then rejects the entire scrape, exporting zero metrics. Fix: add a per-cycle seenGroups map in collectConsumerGroups so each group emits its metrics exactly once regardless of how many broker shard responses contain it. Fixes redpanda-data#335 Co-authored-by: lu-you <lu-you@users.noreply.github.com> Note: fix approach informed by AI analysis; logic reviewed and tested manually on Debian 11 against a 3-broker cluster with 38+ member consumer groups.
Contributor
|
Saw this, this can be fixed in upstream twmb/franz-go#1316. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a Kafka cluster has consumer groups with 38+ active members subscribed to 5+ topics, kminion fails to serve any metrics with:
The Prometheus registry rejects the entire scrape — zero metrics exported. Reproducible on both
adminApiandoffsetsTopicscrape modes.Fixes #335
Root Cause
listConsumerGroupscallsListGroupsvia franz-go, which useslistGroupsSharder→allBrokersShardedReqto fan the request out to all brokers simultaneously. franz-go merges responses by simple concatenation (merged.Groups = append(merged.Groups, resp.Groups...)).During a coordinator migration (common when large consumer groups rebalance), both the old and new coordinator broker transiently include the same group in their
ListGroupsresponse, producing duplicate group IDs in the merged list.Those duplicate IDs flow into
DescribeGroups, causing the same group to appear multiple times in the response. Without a guard,consumer_group_infoandconsumer_group_membersare emitted more than once per scrape, and Prometheus rejects the entire scrape.Fix
Add a per-cycle
seenGroupsmap incollectConsumerGroupsso each group emits its metrics exactly once, regardless of how many broker shard responses contain it.Impact
seenGroupsmap is a no-op when no duplicates existcollectConsumerGroupLagsis a separate path and not touched by this changeTesting
Manually tested on Debian 11 against a 3-broker Kafka cluster with 38+ member consumer groups.
/metricsendpoint returns cleanly without duplicate errors.