fix: deduplicate consumer groups to prevent scrape failure (#335) by lu-you · Pull Request #336 · redpanda-data/kminion

lu-you · 2026-04-21T04:08:02Z

Problem

When a Kafka cluster has consumer groups with 38+ active members subscribed to 5+ topics, kminion fails to serve any metrics with:

An error has occurred while serving metrics:
collected metric "kminion_kafka_consumer_group_info" { label:{name:"group_id" value:"my-group"} ... }
was collected before with the same name and label values

The Prometheus registry rejects the entire scrape — zero metrics exported. Reproducible on both adminApi and offsetsTopic scrape modes.

Fixes #335

Root Cause

listConsumerGroups calls ListGroups via franz-go, which uses listGroupsSharder → allBrokersShardedReq to fan the request out to all brokers simultaneously. franz-go merges responses by simple concatenation (merged.Groups = append(merged.Groups, resp.Groups...)).

During a coordinator migration (common when large consumer groups rebalance), both the old and new coordinator broker transiently include the same group in their ListGroups response, producing duplicate group IDs in the merged list.

Those duplicate IDs flow into DescribeGroups, causing the same group to appear multiple times in the response. Without a guard, consumer_group_info and consumer_group_members are emitted more than once per scrape, and Prometheus rejects the entire scrape.

Fix

Add a per-cycle seenGroups map in collectConsumerGroups so each group emits its metrics exactly once, regardless of how many broker shard responses contain it.

Impact

No change in normal operation — the seenGroups map is a no-op when no duplicates exist
During coordinator migration (rare, transient): first shard response wins (ordered by broker node ID); data self-corrects on the next scrape cycle
Lag metrics unaffected — collectConsumerGroupLags is a separate path and not touched by this change

Testing

Manually tested on Debian 11 against a 3-broker Kafka cluster with 38+ member consumer groups. /metrics endpoint returns cleanly without duplicate errors.

Note: fix approach informed by AI analysis; logic reviewed and tested manually.

…data#335) During a coordinator migration (triggered when large consumer groups rebalance), franz-go's ListGroups fans out to all brokers via allBrokersShardedReq and merges responses by simple concatenation. Both the old and new coordinator may transiently return the same group, producing duplicate group IDs. These duplicates flow into DescribeGroups, causing group-level metrics (consumer_group_info, consumer_group_members) to be emitted more than once per scrape cycle. The Prometheus registry then rejects the entire scrape, exporting zero metrics. Fix: add a per-cycle seenGroups map in collectConsumerGroups so each group emits its metrics exactly once regardless of how many broker shard responses contain it. Fixes redpanda-data#335 Co-authored-by: lu-you <lu-you@users.noreply.github.com> Note: fix approach informed by AI analysis; logic reviewed and tested manually on Debian 11 against a 3-broker cluster with 38+ member consumer groups.

twmb · 2026-04-25T21:10:14Z

Saw this, this can be fixed in upstream twmb/franz-go#1316.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deduplicate consumer groups to prevent scrape failure (#335)#336

fix: deduplicate consumer groups to prevent scrape failure (#335)#336
lu-you wants to merge 1 commit into
redpanda-data:masterfrom
lu-you:lu-you/fix-duplicate-consumer-group-metrics

lu-you commented Apr 21, 2026

Uh oh!

twmb commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lu-you commented Apr 21, 2026

Problem

Root Cause

Fix

Impact

Testing

Uh oh!

twmb commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants