Skip to content

fix: deduplicate consumer groups to prevent scrape failure (#335)#336

Open
lu-you wants to merge 1 commit into
redpanda-data:masterfrom
lu-you:lu-you/fix-duplicate-consumer-group-metrics
Open

fix: deduplicate consumer groups to prevent scrape failure (#335)#336
lu-you wants to merge 1 commit into
redpanda-data:masterfrom
lu-you:lu-you/fix-duplicate-consumer-group-metrics

Conversation

@lu-you

@lu-you lu-you commented Apr 21, 2026

Copy link
Copy Markdown

Problem

When a Kafka cluster has consumer groups with 38+ active members subscribed to 5+ topics, kminion fails to serve any metrics with:

An error has occurred while serving metrics:
collected metric "kminion_kafka_consumer_group_info" { label:{name:"group_id" value:"my-group"} ... }
was collected before with the same name and label values

The Prometheus registry rejects the entire scrape — zero metrics exported. Reproducible on both adminApi and offsetsTopic scrape modes.

Fixes #335

Root Cause

listConsumerGroups calls ListGroups via franz-go, which uses listGroupsSharderallBrokersShardedReq to fan the request out to all brokers simultaneously. franz-go merges responses by simple concatenation (merged.Groups = append(merged.Groups, resp.Groups...)).

During a coordinator migration (common when large consumer groups rebalance), both the old and new coordinator broker transiently include the same group in their ListGroups response, producing duplicate group IDs in the merged list.

Those duplicate IDs flow into DescribeGroups, causing the same group to appear multiple times in the response. Without a guard, consumer_group_info and consumer_group_members are emitted more than once per scrape, and Prometheus rejects the entire scrape.

Fix

Add a per-cycle seenGroups map in collectConsumerGroups so each group emits its metrics exactly once, regardless of how many broker shard responses contain it.

Impact

  • No change in normal operation — the seenGroups map is a no-op when no duplicates exist
  • During coordinator migration (rare, transient): first shard response wins (ordered by broker node ID); data self-corrects on the next scrape cycle
  • Lag metrics unaffectedcollectConsumerGroupLags is a separate path and not touched by this change

Testing

Manually tested on Debian 11 against a 3-broker Kafka cluster with 38+ member consumer groups. /metrics endpoint returns cleanly without duplicate errors.


Note: fix approach informed by AI analysis; logic reviewed and tested manually.

…data#335)

During a coordinator migration (triggered when large consumer groups
rebalance), franz-go's ListGroups fans out to all brokers via
allBrokersShardedReq and merges responses by simple concatenation.
Both the old and new coordinator may transiently return the same group,
producing duplicate group IDs. These duplicates flow into DescribeGroups,
causing group-level metrics (consumer_group_info, consumer_group_members)
to be emitted more than once per scrape cycle. The Prometheus registry
then rejects the entire scrape, exporting zero metrics.

Fix: add a per-cycle seenGroups map in collectConsumerGroups so each
group emits its metrics exactly once regardless of how many broker shard
responses contain it.

Fixes redpanda-data#335

Co-authored-by: lu-you <lu-you@users.noreply.github.com>
Note: fix approach informed by AI analysis; logic reviewed and tested manually on Debian 11 against a 3-broker cluster with 38+ member consumer groups.
@twmb

twmb commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Saw this, this can be fixed in upstream twmb/franz-go#1316.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicate metric collection error causes complete scrape failure for consumer groups

2 participants