Skip to content

feat(dispatch): add alert markers per group#5047

Open
siavashs wants to merge 12 commits intoprometheus:mainfrom
siavashs:feat/group_markers
Open

feat(dispatch): add alert markers per group#5047
siavashs wants to merge 12 commits intoprometheus:mainfrom
siavashs:feat/group_markers

Conversation

@siavashs
Copy link
Copy Markdown
Contributor

@siavashs siavashs commented Feb 25, 2026

This change adds alert markers to the aggregation groups in dispatcher.
Alert markers replace the global marker and are used to track the state of alerts in each aggregation group.

This change touches many components of the alertmanager.
Per Group alert markers are passed to the notifiers and then inhibitor and silencer using context.

The API has no breaking changes:

  • /alerts uses a temporary marker to track the state of alerts
  • /alerts/groups returns the status from group markers

The metrics are also updated to use group markers.
The alertmanager_alerts metric is moved to dispatcher.
The alertmanager_marked_alerts metric is removed.
By default it behaves the same as before, by aggregating all alerts in the groups.
Enabling group-key-in-metrics flag will cause the metrics to be grouped by group_key.

Pull Request Checklist

Please check all the applicable boxes.

Which user-facing changes does this PR introduce?

[FEATURE] Introduce per aggregation group AlertMarkers and drop Global Alert Marker
[CHANGE] Add `group-key-in-metrics` feature flag
[CHANGE] Remove `alertmanager_marked_alerts`
[CHANGE] Remove the following from `types` package: `MemMarker`, `AlertState*`, `AlertStatus`
[CHANGE] Move `AlertMarker`, `GroupMarker` to `marker` package

Summary by CodeRabbit

  • New Features

    • Optional group-key labeling in dispatcher metrics via a new feature flag for richer alert-group metrics.
    • New dispatcher metrics and per-state alert counts to improve observability.
    • New runtime markers for per-alert and per-group state tracking, surfaced to runtime operations.
  • Bug Fixes

    • More accurate alert status resolution computed per-request, improving active/suppressed/unprocessed consistency.
    • More reliable silence and inhibit handling so mute/inhibition state is reflected correctly.
  • Refactor

    • Simplified configuration: alert-status callback is no longer required.

@siavashs siavashs changed the title feat(dispatch): add group markers feat(dispatch): add alert markers per group Feb 25, 2026
Comment thread dispatch/dispatch.go
@TheMeier
Copy link
Copy Markdown
Contributor

TheMeier commented Mar 1, 2026

Reviewed the new marker package standalone. LGTM

Comment thread marker/util.go Outdated
Comment thread marker/group.go Outdated
Comment thread marker/alert.go
}

// SetInhibited implements AlertMarker.
func (m *alertMarker) SetInhibited(fp model.Fingerprint, inhibitedBy []string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity the previous SetIhibited used
func (m *MemMarker) SetInhibited(alert model.Fingerprint, ids ...string) {
And we changed it to []string ... Is there a reason we changed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for consistency in the interface API for both SetInhibited and SetSilenced.

Comment thread types/types.go
delete(m.groups, routeID+groupKey)
}

func (m *MemMarker) registerMetrics(r prometheus.Registerer) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should note in the release note that we removed this metric, as it would be too expensive to calculate it across all markers... ? Unless each time we change it a marker we were to add/remove the relevant number from the metric... But yeah, if not worth it/useful definitely we should say it's gone

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in the PR description:

[CHANGE] Remove `alertmanager_marked_alerts`

Do we need to also update any MD files? cc @SoloJacobs

Comment thread marker/marker.go
Comment thread marker/alert.go Outdated
Comment thread dispatch/metric.go Outdated
Comment thread notify/mute.go
} else {
filtered = append(filtered, a)
}
// TODO(fabxc): increment muted alerts counter if muted.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are this TODO and the one above still relevant?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he wanted to add/change the metrics here.
I am not sure if we need it.
I can remove these but my changes don't touch these.

Comment thread marker/group.go Outdated
@siavashs siavashs force-pushed the feat/group_markers branch from 724c634 to 9a5c33e Compare March 11, 2026 10:54
@siavashs siavashs requested a review from ultrotter March 11, 2026 16:19
@siavashs
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 16, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Migrates from a global marker to per-aggregation-group markers, removes callback-based alert-status plumbing (Options.AlertStatusFunc), introduces a marker package with context propagation, switches code from types.Alert to alert.Alert, and extracts dispatcher metrics into a dedicated metrics implementation.

Changes

Cohort / File(s) Summary
API layer
api/api.go, api/v2/api.go, api/v2/api_test.go, api/v2/compat.go
Removed Options.AlertStatusFunc, stopped passing alert-status callbacks into APIv2, switched handlers/filters to compute status via a context-provided marker.AlertMarker, and converted APIs/tests to *alert.Alert.
Marker package (new)
marker/marker.go, marker/alert.go, marker/status.go, marker/group.go, marker/context.go, marker/alert_test.go, marker/group_test.go
Added AlertMarker and GroupMarker interfaces, in-memory NewAlertMarker/NewGroupMarker implementations, context helpers (WithContext/FromContext), state type, and unit tests.
Dispatcher & metrics
dispatch/dispatch.go, dispatch/metric.go, dispatch/dispatch_test.go, dispatch/dispatch_bench_test.go
Switched dispatcher to use marker.GroupMarker and per-aggregation AlertMarker; AlertGroup now carries AlertStatuses. Added DispatcherMetrics and alertStateCollector with optional group-key labeling (feature-flagged). Tests/bench updated.
Main wiring
cmd/alertmanager/main.go
Replaced global marker creation with marker.NewGroupMarker(), removed marker constructor args from silencer/inhibitor/mem wiring, and updated dispatcher metrics construction to accept feature flags.
Notification pipeline & stages
notify/notify.go, notify/mute.go, notify/mute_test.go
Propagated *alert.Alert throughout notification pipeline and stages; stages now accept/return *alert.Alert and use marker.GroupMarker/context for group mutes; time stages ensure marker cleared on early returns.
Silencer & Inhibitor
silence/silence.go, silence/silence_test.go, silence/silence_bench_test.go, inhibit/inhibit.go, inhibit/inhibit_test.go, inhibit/inhibit_bench_test.go
Removed stored marker from Silencer/Inhibitor constructors; Mutes now read AlertMarker from context and defer marker updates. Tests updated to inject markers via context.
In-memory alerts provider
provider/mem/mem.go, provider/mem/mem_test.go
Dropped marker field and constructor parameter from Alerts; stopped deleting per-alert markers during GC and removed alerts-by-state metric plumbing. Tests adjusted accordingly.
Types cleanup
types/types.go, types/types_test.go
Removed legacy marker aliases/interfaces and NewMarker/MemMarker implementations/tests from types package; kept deprecated alert aliases.
Feature flags
featurecontrol/featurecontrol.go
Added group-key-in-metrics feature flag and API EnableGroupKeyInMetrics() to control group-key labeling in dispatcher metrics.
Alert helpers
alert/state.go
Added AlertState.Compare(other AlertState) int to define ordering (suppressed > active > unprocessed).

Sequence Diagram

sequenceDiagram
    participant Client
    participant APIv2
    participant Dispatcher
    participant GroupMarker
    participant AlertMarker
    participant Context

    Client->>APIv2: GET /api/v2/alerts
    APIv2->>Dispatcher: request groups + alertFilter(ctx)
    Dispatcher->>GroupMarker: ensure group-level marker
    Dispatcher->>AlertMarker: create NewAlertMarker() per-aggregation-group
    Dispatcher->>Context: ctx' = marker.WithContext(ctx, alertMarker)
    Dispatcher->>APIv2: return AlertGroup(s) with attached ctx'
    APIv2->>AlertMarker: Status(fp) via marker.FromContext(ctx') for each alert
    AlertMarker-->>APIv2: alert.AlertStatus (state, silenced, inhibited)
    APIv2-->>Client: OpenAPI response enriched with per-alert statuses
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(dispatch): add alert markers per group' accurately summarizes the main architectural change—introducing per-aggregation-group alert markers in the dispatcher.
Description check ✅ Passed The PR description comprehensively covers the feature objectives, implementation details, API compatibility guarantees, and user-facing changes including new feature flags and removed types.
Linked Issues check ✅ Passed The PR fully addresses issue #4953 by replacing the global marker with per-aggregation-group markers throughout the codebase while maintaining API compatibility and adding metrics enhancements.
Out of Scope Changes check ✅ Passed All changes are directly related to the core objective of migrating from global to per-group markers, including supporting infrastructure (context propagation, marker types, metrics, imports).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 30, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
notify/notify.go (1)

37-43: ⚠️ Potential issue | 🟡 Minor

Remove unused types import from dispatch/dispatch_test.go.

The types import in notify/notify.go (line 43) is needed and used at line 887 ([]*types.Alert). However, the same import in dispatch/dispatch_test.go (line 36) is unused and should be removed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notify/notify.go` around lines 37 - 43, Remove the unused "types" import from
dispatch/dispatch_test.go: open the import block in dispatch/dispatch_test.go
and delete the "github.com/prometheus/alertmanager/types" entry (it's not
referenced in that test), then run `go vet`/`go test` to ensure no other usages;
do not change the existing usage of types in notify/notify.go (e.g., the
[]*types.Alert at line 887) which must remain.
🧹 Nitpick comments (3)
marker/alert.go (1)

38-48: Copy the reason slices at the marker boundary.

SetSilenced / SetInhibited keep the caller’s slice as-is, and Status returns the same backing arrays back out. That makes the marker’s internal state externally mutable if a caller reuses or edits either slice later. Clone on write or before returning so the marker fully owns its state.

Also applies to: 51-61, 79-85

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@marker/alert.go` around lines 38 - 48, SetSilenced, SetInhibited and Status
currently reuse caller slices and return the same backing arrays, exposing
internal state; fix by copying slices on write and before returning so the
marker owns its data. In alertMarker.SetSilenced and alertMarker.SetInhibited,
replace assignments to alertStatus.SilencedBy / InhibitedBy with a copied slice
(make + copy) of the incoming slice; in alertMarker.Status, when constructing
the returned Alert or alertStatus, clone any slices you return (SilencedBy,
InhibitedBy, and Reason slices if present) so callers never get pointers to the
marker’s internal backing arrays.
dispatch/metric.go (1)

35-79: registerLimitMetrics is ignored here.

NewDispatcherMetrics still registers alertmanager_dispatcher_aggregation_group_limit_reached_total unconditionally, so callers passing false no longer get the behavior the signature advertises. Either wire the flag back into registration or drop the parameter to avoid a misleading API.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/metric.go` around lines 35 - 79, The function NewDispatcherMetrics
ignores the registerLimitMetrics parameter but still unconditionally
creates/registers the aggrGroupLimitReached counter; update NewDispatcherMetrics
so that when registerLimitMetrics is false it does not create or register
alertmanager_dispatcher_aggregation_group_limit_reached_total (e.g., set
aggrGroupLimitReached to nil or a no-op metric) and only call
promauto.With(r).NewCounter for aggrGroupLimitReached when registerLimitMetrics
is true; alternatively, if the flag is unnecessary, remove the
registerLimitMetrics parameter and its callers—make the change in
NewDispatcherMetrics and the DispatcherMetrics struct usage
(aggrGroupLimitReached) accordingly.
marker/group_test.go (1)

53-75: Consider adding a concurrent access test.

Given that GroupMarker uses sync.RWMutex for thread safety, a test exercising concurrent SetMuted/Muted/DeleteByGroupKey calls would help validate the locking behavior under contention.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@marker/group_test.go` around lines 53 - 75, Add a new test that creates a
GroupMarker via NewGroupMarker and runs concurrent goroutines calling SetMuted,
Muted, and DeleteByGroupKey to exercise the RWMutex under contention; use
sync.WaitGroup and multiple iterations with randomized sleeps to interleave
operations, assert no panics/races and that final state matches expected (e.g.,
set a key, concurrently read and delete it, then verify Muted returns false
after DeleteByGroupKey). Target the methods NewGroupMarker, SetMuted, Muted, and
DeleteByGroupKey in the test to validate thread-safety.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dispatch/dispatch_test.go`:
- Line 37: The build is failing because the import
"github.com/prometheus/alertmanager/types" is unused in dispatch_test.go; remove
that import from the file's import block (or replace it with a blank identifier
only if intentionally needed) so the test compiles—locate the import declaration
referencing types in dispatch_test.go and delete it or remove any references to
types.* so the compiler no longer reports an unused import.

In `@dispatch/metric.go`:
- Around line 124-149: The seen map population in the loop over
d.routeGroupsSlice (aggrGroup -> ag.alerts.List()) uses first-write-wins; change
it so on each fingerprint fp encountered you compute the incoming state via
ag.marker.Status(fp).State and merge it with any existing seen[fp] using
explicit precedence suppressed > active > unprocessed (i.e., if existing is
lower precedence than incoming, replace it; otherwise keep existing). Keep the
rest of the counting logic over seen unchanged.

In `@notify/notify.go`:
- Line 887: The method RetryStage.exec currently accepts variadic alerts of type
*alert.Alert but declares it returns []*types.Alert and then returns the
incoming alerts directly, causing a type mismatch; fix by converting the
incoming []*alert.Alert to []*types.Alert before returning (or change the
parameter type to []*types.Alert if callers can be updated). Locate
RetryStage.exec and either (A) map each *alert.Alert to a new *types.Alert
(create a small converter inside exec or call an existing conversion function)
and return that slice, or (B) change the exec signature to accept alerts
...*types.Alert and update call sites accordingly so the returned value matches
the declared []*types.Alert.

---

Outside diff comments:
In `@notify/notify.go`:
- Around line 37-43: Remove the unused "types" import from
dispatch/dispatch_test.go: open the import block in dispatch/dispatch_test.go
and delete the "github.com/prometheus/alertmanager/types" entry (it's not
referenced in that test), then run `go vet`/`go test` to ensure no other usages;
do not change the existing usage of types in notify/notify.go (e.g., the
[]*types.Alert at line 887) which must remain.

---

Nitpick comments:
In `@dispatch/metric.go`:
- Around line 35-79: The function NewDispatcherMetrics ignores the
registerLimitMetrics parameter but still unconditionally creates/registers the
aggrGroupLimitReached counter; update NewDispatcherMetrics so that when
registerLimitMetrics is false it does not create or register
alertmanager_dispatcher_aggregation_group_limit_reached_total (e.g., set
aggrGroupLimitReached to nil or a no-op metric) and only call
promauto.With(r).NewCounter for aggrGroupLimitReached when registerLimitMetrics
is true; alternatively, if the flag is unnecessary, remove the
registerLimitMetrics parameter and its callers—make the change in
NewDispatcherMetrics and the DispatcherMetrics struct usage
(aggrGroupLimitReached) accordingly.

In `@marker/alert.go`:
- Around line 38-48: SetSilenced, SetInhibited and Status currently reuse caller
slices and return the same backing arrays, exposing internal state; fix by
copying slices on write and before returning so the marker owns its data. In
alertMarker.SetSilenced and alertMarker.SetInhibited, replace assignments to
alertStatus.SilencedBy / InhibitedBy with a copied slice (make + copy) of the
incoming slice; in alertMarker.Status, when constructing the returned Alert or
alertStatus, clone any slices you return (SilencedBy, InhibitedBy, and Reason
slices if present) so callers never get pointers to the marker’s internal
backing arrays.

In `@marker/group_test.go`:
- Around line 53-75: Add a new test that creates a GroupMarker via
NewGroupMarker and runs concurrent goroutines calling SetMuted, Muted, and
DeleteByGroupKey to exercise the RWMutex under contention; use sync.WaitGroup
and multiple iterations with randomized sleeps to interleave operations, assert
no panics/races and that final state matches expected (e.g., set a key,
concurrently read and delete it, then verify Muted returns false after
DeleteByGroupKey). Target the methods NewGroupMarker, SetMuted, Muted, and
DeleteByGroupKey in the test to validate thread-safety.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 62685eb0-6c40-469e-87cc-0d7fd5712f7b

📥 Commits

Reviewing files that changed from the base of the PR and between 55723d3 and 0db0801.

📒 Files selected for processing (30)
  • api/api.go
  • api/v2/api.go
  • api/v2/api_test.go
  • api/v2/compat.go
  • cmd/alertmanager/main.go
  • dispatch/dispatch.go
  • dispatch/dispatch_bench_test.go
  • dispatch/dispatch_test.go
  • dispatch/metric.go
  • featurecontrol/featurecontrol.go
  • inhibit/inhibit.go
  • inhibit/inhibit_bench_test.go
  • inhibit/inhibit_test.go
  • marker/alert.go
  • marker/alert_test.go
  • marker/group.go
  • marker/group_test.go
  • marker/marker.go
  • marker/status.go
  • marker/util.go
  • notify/mute.go
  • notify/mute_test.go
  • notify/notify.go
  • provider/mem/mem.go
  • provider/mem/mem_test.go
  • silence/silence.go
  • silence/silence_bench_test.go
  • silence/silence_test.go
  • types/types.go
  • types/types_test.go
💤 Files with no reviewable changes (4)
  • api/api.go
  • types/types_test.go
  • provider/mem/mem.go
  • types/types.go
✅ Files skipped from review due to trivial changes (3)
  • marker/status.go
  • featurecontrol/featurecontrol.go
  • api/v2/api_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • inhibit/inhibit_bench_test.go
  • notify/mute_test.go
  • silence/silence_test.go
  • notify/mute.go

Comment thread dispatch/dispatch_test.go Outdated
Comment thread dispatch/metric.go
Comment thread notify/notify.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
inhibit/inhibit_test.go (1)

312-356: ⚠️ Potential issue | 🟡 Minor

Test case comments are misleading and don't match expected values.

Several comments in this test don't align with the expected results. This appears to be copy-paste from TestInhibitRuleMatches without updating for the different matcher types (rule1 uses MatchNotEqual for t1, not MatchEqual).

Examples:

  • Line 313: Comment says "inhibited" but expected: false
  • Line 335: Comment says "not inhibited" but expected: true
  • Line 349: Comment says "not inhibited" but expected: true

The expected values themselves appear correct for the MatchNotEqual semantics, but the comments should be updated to avoid confusion for future maintainers.

📝 Suggested comment fixes
 		{
-			// Matches target filter of rule1, inhibited.
+			// Does NOT match target filter of rule1 (MatchNotEqual t1!=1), not inhibited.
 			target:   model.LabelSet{"t1": "1", "e": "1"},
 			expected: false,
 		},
 		{
-			// Doesn't match target filter, not inhibited.
+			// Matches target filter of rule1 (t1!=1, so t1=0 matches), inhibited.
 			target:   model.LabelSet{"t1": "0", "e": "1"},
 			expected: true,
 		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inhibit/inhibit_test.go` around lines 312 - 356, Update the misleading inline
comments in the inhibit test cases to reflect the actual semantics of rule1
(which uses MatchNotEqual for label "t1") rather than MatchEqual; specifically,
review each test case referencing rule1 and rule2 (symbols: rule1, rule2,
MatchNotEqual) and change the comment text to match the existing expected
boolean values (keep expected values as-is), e.g. where a comment currently says
"inhibited" but expected:false change it to "not inhibited (because
MatchNotEqual on t1)" and where it says "not inhibited" but expected:true change
to "inhibited (because MatchNotEqual on t1 does not block)"; ensure all comments
describing source/target matching (mentions of sourceAlert1, sourceAlert2,
target filters) are consistent with MatchNotEqual semantics.
♻️ Duplicate comments (2)
dispatch/metric.go (1)

124-153: ⚠️ Potential issue | 🟠 Major

Merge duplicate fingerprint states with explicit precedence.

With per-group markers, the same fingerprint can be active in one aggregation group and suppressed in another. The current first-write-wins logic in seen (line 132-134) makes the metric non-deterministic depending on iteration order. Consider merging with explicit precedence (e.g., suppressed > active > unprocessed).

🛠️ Proposed fix with merge precedence
 // Deduplicate by fingerprint for backward compatibility.
 seen := map[model.Fingerprint]alert.AlertState{}
 for i := range d.routeGroupsSlice {
     d.routeGroupsSlice[i].groups.Range(func(_, el any) bool {
         ag := el.(*aggrGroup)
         for _, a := range ag.alerts.List() {
             fp := a.Fingerprint()
             if !a.Resolved() {
-                if _, ok := seen[fp]; !ok {
-                    seen[fp] = ag.marker.Status(fp).State
-                }
+                incoming := ag.marker.Status(fp).State
+                existing, ok := seen[fp]
+                if !ok || statePrecedence(incoming) > statePrecedence(existing) {
+                    seen[fp] = incoming
+                }
             }
         }
         return true
     })
 }

+// statePrecedence returns the precedence of the state for merging.
+// Higher precedence wins when the same fingerprint appears in multiple groups.
+func statePrecedence(s alert.AlertState) int {
+    switch s {
+    case alert.AlertStateSuppressed:
+        return 2
+    case alert.AlertStateActive:
+        return 1
+    default: // unprocessed
+        return 0
+    }
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/metric.go` around lines 124 - 153, The deduplication currently uses
first-write-wins in the seen map (seen :=
map[model.Fingerprint]alert.AlertState{}) when iterating d.routeGroupsSlice and
aggrGroup alerts, causing non-deterministic metric values; change the logic
inside the loop that does fp := a.Fingerprint() and sets seen[fp] =
ag.marker.Status(fp).State to instead merge states by explicit precedence: treat
AlertStateSuppressed highest, AlertStateActive next, AlertStateUnprocessed
lowest, and when a fingerprint already exists in seen compare the existing state
and the new ag.marker.Status(fp).State and replace it only if the new state has
higher precedence so the final seen map deterministically reflects suppressed >
active > unprocessed before the counting/emission.
notify/notify.go (1)

887-887: ⚠️ Potential issue | 🔴 Critical

Critical: Return type mismatch will cause a compile error.

The exec method declares return type []*types.Alert but receives alerts ...*alert.Alert and returns alerts directly. This type mismatch won't compile.

🐛 Proposed fix
-func (r RetryStage) exec(ctx context.Context, l *slog.Logger, alerts ...*alert.Alert) (context.Context, []*types.Alert, error) {
+func (r RetryStage) exec(ctx context.Context, l *slog.Logger, alerts ...*alert.Alert) (context.Context, []*alert.Alert, error) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notify/notify.go` at line 887, The exec method signature on RetryStage uses
parameter type alerts ...*alert.Alert but declares it returns []*types.Alert,
causing a type mismatch; fix by changing the parameter type to ...*types.Alert
in the RetryStage.exec signature (and update all callers to pass types.Alert
pointers), or alternatively, keep alerts as ...*alert.Alert and convert/marshal
each alert.Alert into a *types.Alert slice before returning; refer to the
function name RetryStage.exec and the symbols alerts, alert.Alert and
types.Alert when making the change.
🧹 Nitpick comments (1)
marker/group_test.go (1)

22-75: LGTM!

The tests provide good coverage of the GroupMarker functionality including isolation between different route/group combinations and clearing behavior.

Consider adding a concurrent access test in the future to validate thread safety under load, though this is not critical for the initial implementation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@marker/group_test.go` around lines 22 - 75, Add a concurrent-access unit test
for GroupMarker to validate thread-safety: create a test (e.g.,
TestGroupMarker_ConcurrentAccess) that exercises GroupMarker.SetMuted,
GroupMarker.Muted and GroupMarker.DeleteByGroupKey from many goroutines using
sync.WaitGroup and randomized or patterned operations, wait for completion, and
assert final invariants (for example that Muted returns expected
timeIntervalNames or empty after deletes); ensure the test covers mixed
reads/writes across multiple route/group combinations (route1/route2,
group1/group2) to catch race conditions and run with -race in CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@inhibit/inhibit_test.go`:
- Around line 312-356: Update the misleading inline comments in the inhibit test
cases to reflect the actual semantics of rule1 (which uses MatchNotEqual for
label "t1") rather than MatchEqual; specifically, review each test case
referencing rule1 and rule2 (symbols: rule1, rule2, MatchNotEqual) and change
the comment text to match the existing expected boolean values (keep expected
values as-is), e.g. where a comment currently says "inhibited" but
expected:false change it to "not inhibited (because MatchNotEqual on t1)" and
where it says "not inhibited" but expected:true change to "inhibited (because
MatchNotEqual on t1 does not block)"; ensure all comments describing
source/target matching (mentions of sourceAlert1, sourceAlert2, target filters)
are consistent with MatchNotEqual semantics.

---

Duplicate comments:
In `@dispatch/metric.go`:
- Around line 124-153: The deduplication currently uses first-write-wins in the
seen map (seen := map[model.Fingerprint]alert.AlertState{}) when iterating
d.routeGroupsSlice and aggrGroup alerts, causing non-deterministic metric
values; change the logic inside the loop that does fp := a.Fingerprint() and
sets seen[fp] = ag.marker.Status(fp).State to instead merge states by explicit
precedence: treat AlertStateSuppressed highest, AlertStateActive next,
AlertStateUnprocessed lowest, and when a fingerprint already exists in seen
compare the existing state and the new ag.marker.Status(fp).State and replace it
only if the new state has higher precedence so the final seen map
deterministically reflects suppressed > active > unprocessed before the
counting/emission.

In `@notify/notify.go`:
- Line 887: The exec method signature on RetryStage uses parameter type alerts
...*alert.Alert but declares it returns []*types.Alert, causing a type mismatch;
fix by changing the parameter type to ...*types.Alert in the RetryStage.exec
signature (and update all callers to pass types.Alert pointers), or
alternatively, keep alerts as ...*alert.Alert and convert/marshal each
alert.Alert into a *types.Alert slice before returning; refer to the function
name RetryStage.exec and the symbols alerts, alert.Alert and types.Alert when
making the change.

---

Nitpick comments:
In `@marker/group_test.go`:
- Around line 22-75: Add a concurrent-access unit test for GroupMarker to
validate thread-safety: create a test (e.g., TestGroupMarker_ConcurrentAccess)
that exercises GroupMarker.SetMuted, GroupMarker.Muted and
GroupMarker.DeleteByGroupKey from many goroutines using sync.WaitGroup and
randomized or patterned operations, wait for completion, and assert final
invariants (for example that Muted returns expected timeIntervalNames or empty
after deletes); ensure the test covers mixed reads/writes across multiple
route/group combinations (route1/route2, group1/group2) to catch race conditions
and run with -race in CI.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 14704549-fc4f-4c3d-99d7-c49241be9280

📥 Commits

Reviewing files that changed from the base of the PR and between 0db0801 and 35b9fb6.

📒 Files selected for processing (30)
  • api/api.go
  • api/v2/api.go
  • api/v2/api_test.go
  • api/v2/compat.go
  • cmd/alertmanager/main.go
  • dispatch/dispatch.go
  • dispatch/dispatch_bench_test.go
  • dispatch/dispatch_test.go
  • dispatch/metric.go
  • featurecontrol/featurecontrol.go
  • inhibit/inhibit.go
  • inhibit/inhibit_bench_test.go
  • inhibit/inhibit_test.go
  • marker/alert.go
  • marker/alert_test.go
  • marker/group.go
  • marker/group_test.go
  • marker/marker.go
  • marker/status.go
  • marker/util.go
  • notify/mute.go
  • notify/mute_test.go
  • notify/notify.go
  • provider/mem/mem.go
  • provider/mem/mem_test.go
  • silence/silence.go
  • silence/silence_bench_test.go
  • silence/silence_test.go
  • types/types.go
  • types/types_test.go
💤 Files with no reviewable changes (4)
  • types/types_test.go
  • types/types.go
  • provider/mem/mem.go
  • api/api.go
✅ Files skipped from review due to trivial changes (9)
  • marker/status.go
  • silence/silence_bench_test.go
  • marker/alert.go
  • inhibit/inhibit_bench_test.go
  • marker/alert_test.go
  • dispatch/dispatch_test.go
  • marker/marker.go
  • featurecontrol/featurecontrol.go
  • silence/silence.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • cmd/alertmanager/main.go
  • api/v2/api_test.go
  • marker/util.go
  • silence/silence_test.go
  • notify/mute_test.go
  • notify/mute.go

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
dispatch/metric.go (1)

124-149: ⚠️ Potential issue | 🟠 Major

Deduped alert-state metrics are still order-dependent.

With per-group markers, the same fingerprint can legitimately have different states in different aggregation groups. First-write-wins on seen makes the default alertmanager_alerts result depend on sync.Map.Range order instead of a defined precedence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/metric.go` around lines 124 - 149, The dedupe uses first-write-wins
on seen which makes results order-dependent; change the assignment logic inside
the loop that populates seen (where seen is
map[model.Fingerprint]alert.AlertState and the code calls
ag.marker.Status(fp).State) to apply a deterministic precedence instead of
skipping when a key exists: compute the newState := ag.marker.Status(fp).State
and update seen[fp] only if newState has higher precedence than the current
seen[fp] (e.g., prefer alert.AlertStateActive over alert.AlertStateSuppressed
over others), or implement a small helper priority function to compare states
before assigning.
🧹 Nitpick comments (3)
cmd/alertmanager/main.go (1)

291-291: Minor: Variable shadows package name.

The local variable marker shadows the imported marker package. While this works correctly (the package is only used for the NewGroupMarker() call at initialization), consider renaming to groupMarker for clarity.

♻️ Suggested rename for clarity
-	marker := marker.NewGroupMarker()
+	groupMarker := marker.NewGroupMarker()

Then update references at lines 381, 492, 510 accordingly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/alertmanager/main.go` at line 291, The local variable named marker
shadows the imported marker package; rename the local variable returned by
marker.NewGroupMarker() to groupMarker (e.g., groupMarker :=
marker.NewGroupMarker()) and update all intra-file references that use that
local variable (formerly marker) to groupMarker so the package identifier marker
remains unshadowed—ensure you update every use site that currently refers to the
local variable (the subsequent references in this file).
dispatch/dispatch_test.go (1)

433-438: Assert the AlertStatuses contents, not just that the map exists.

This only verifies allocation. A regression that populates every entry with the wrong state would still pass, even though Groups() now exposes these per-alert statuses.

✅ Small test hardening
 	for _, ag := range alertGroups {
 		require.NotNil(t, ag.AlertStatuses)
+		for _, a := range ag.Alerts {
+			status, ok := ag.AlertStatuses[a.Fingerprint()]
+			require.True(t, ok)
+			require.Equal(t, alert.AlertStateActive, status.State)
+		}
 		ag.AlertStatuses = nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/dispatch_test.go` around lines 433 - 438, Test currently only checks
AlertStatuses is non-nil then clears it; instead assert the actual contents
before nil-ing so regressions in per-alert state are caught. In the loop over
alertGroups (variable alertGroups, field AlertStatuses) replace the single
require.NotNil with a check that AlertStatuses equals the expected map for that
group (use require.Equal or similar against your expected per-alert status map
for this test), then set ag.AlertStatuses = nil so the later struct comparison
still works; ensure you reference the same keys/values that Groups() should
expose.
dispatch/metric.go (1)

36-79: registerLimitMetrics is currently ignored.

aggrGroupLimitReached is always registered with promauto.With(r).NewCounter, so callers passing false still expose alertmanager_dispatcher_aggregation_group_limit_reached_total. Either drop the flag or make the false branch create an unregistered counter.

🛠️ One local way to honor the flag
-		aggrGroupLimitReached: promauto.With(r).NewCounter(
-			prometheus.CounterOpts{
-				Name: "alertmanager_dispatcher_aggregation_group_limit_reached_total",
-				Help: "Number of times when dispatcher failed to create new aggregation group due to limit.",
-			},
-		),
+		aggrGroupLimitReached: func() prometheus.Counter {
+			opts := prometheus.CounterOpts{
+				Name: "alertmanager_dispatcher_aggregation_group_limit_reached_total",
+				Help: "Number of times when dispatcher failed to create new aggregation group due to limit.",
+			}
+			if !registerLimitMetrics || r == nil {
+				return prometheus.NewCounter(opts)
+			}
+			return promauto.With(r).NewCounter(opts)
+		}(),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/metric.go` around lines 36 - 79, The registerLimitMetrics boolean is
ignored in NewDispatcherMetrics causing aggrGroupLimitReached to always be
registered; update NewDispatcherMetrics so that when registerLimitMetrics is
true you create and register the counter with promauto.With(r).NewCounter (as
now), and when false you create an unregistered counter using
prometheus.NewCounter(prometheus.CounterOpts{...}) (or a no-op implementation)
and do not register it with r; refer to the NewDispatcherMetrics function and
the aggrGroupLimitReached symbol to implement the conditional
creation/registration so callers passing false no longer expose
alertmanager_dispatcher_aggregation_group_limit_reached_total.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dispatch/dispatch.go`:
- Around line 131-133: Normalize nil metrics when constructing the dispatcher:
if the incoming metrics or metrics.alertsCollector is nil, replace them with a
no-op implementation so subsequent unguarded calls on d.metrics (e.g., in
methods referencing d.metrics at the sites where you currently call
d.metrics.Alert.../d.metrics.*) are safe. Implement a simple no-op type that
satisfies the DispatcherMetrics interface (or the concrete shape returned by
NewDispatcherMetrics) with an alertsCollector containing a dispatcher.Store
method that is a no-op, then in the dispatcher constructor (where you currently
call metrics.alertsCollector.dispatcher.Store(disp)) set metrics =
NewNoopDispatcherMetrics() when metrics == nil or metrics.alertsCollector == nil
before storing the dispatcher.

---

Duplicate comments:
In `@dispatch/metric.go`:
- Around line 124-149: The dedupe uses first-write-wins on seen which makes
results order-dependent; change the assignment logic inside the loop that
populates seen (where seen is map[model.Fingerprint]alert.AlertState and the
code calls ag.marker.Status(fp).State) to apply a deterministic precedence
instead of skipping when a key exists: compute the newState :=
ag.marker.Status(fp).State and update seen[fp] only if newState has higher
precedence than the current seen[fp] (e.g., prefer alert.AlertStateActive over
alert.AlertStateSuppressed over others), or implement a small helper priority
function to compare states before assigning.

---

Nitpick comments:
In `@cmd/alertmanager/main.go`:
- Line 291: The local variable named marker shadows the imported marker package;
rename the local variable returned by marker.NewGroupMarker() to groupMarker
(e.g., groupMarker := marker.NewGroupMarker()) and update all intra-file
references that use that local variable (formerly marker) to groupMarker so the
package identifier marker remains unshadowed—ensure you update every use site
that currently refers to the local variable (the subsequent references in this
file).

In `@dispatch/dispatch_test.go`:
- Around line 433-438: Test currently only checks AlertStatuses is non-nil then
clears it; instead assert the actual contents before nil-ing so regressions in
per-alert state are caught. In the loop over alertGroups (variable alertGroups,
field AlertStatuses) replace the single require.NotNil with a check that
AlertStatuses equals the expected map for that group (use require.Equal or
similar against your expected per-alert status map for this test), then set
ag.AlertStatuses = nil so the later struct comparison still works; ensure you
reference the same keys/values that Groups() should expose.

In `@dispatch/metric.go`:
- Around line 36-79: The registerLimitMetrics boolean is ignored in
NewDispatcherMetrics causing aggrGroupLimitReached to always be registered;
update NewDispatcherMetrics so that when registerLimitMetrics is true you create
and register the counter with promauto.With(r).NewCounter (as now), and when
false you create an unregistered counter using
prometheus.NewCounter(prometheus.CounterOpts{...}) (or a no-op implementation)
and do not register it with r; refer to the NewDispatcherMetrics function and
the aggrGroupLimitReached symbol to implement the conditional
creation/registration so callers passing false no longer expose
alertmanager_dispatcher_aggregation_group_limit_reached_total.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0d731d28-13a3-4d8a-929c-6c07a9cf7c9e

📥 Commits

Reviewing files that changed from the base of the PR and between 0db0801 and 35b9fb6.

📒 Files selected for processing (30)
  • api/api.go
  • api/v2/api.go
  • api/v2/api_test.go
  • api/v2/compat.go
  • cmd/alertmanager/main.go
  • dispatch/dispatch.go
  • dispatch/dispatch_bench_test.go
  • dispatch/dispatch_test.go
  • dispatch/metric.go
  • featurecontrol/featurecontrol.go
  • inhibit/inhibit.go
  • inhibit/inhibit_bench_test.go
  • inhibit/inhibit_test.go
  • marker/alert.go
  • marker/alert_test.go
  • marker/group.go
  • marker/group_test.go
  • marker/marker.go
  • marker/status.go
  • marker/util.go
  • notify/mute.go
  • notify/mute_test.go
  • notify/notify.go
  • provider/mem/mem.go
  • provider/mem/mem_test.go
  • silence/silence.go
  • silence/silence_bench_test.go
  • silence/silence_test.go
  • types/types.go
  • types/types_test.go
💤 Files with no reviewable changes (4)
  • api/api.go
  • types/types_test.go
  • provider/mem/mem.go
  • types/types.go
✅ Files skipped from review due to trivial changes (5)
  • marker/status.go
  • marker/group_test.go
  • marker/marker.go
  • notify/notify.go
  • marker/group.go
🚧 Files skipped from review as they are similar to previous changes (8)
  • api/v2/api_test.go
  • marker/alert.go
  • inhibit/inhibit_bench_test.go
  • silence/silence_test.go
  • featurecontrol/featurecontrol.go
  • provider/mem/mem_test.go
  • dispatch/dispatch_bench_test.go
  • silence/silence.go

Comment thread dispatch/dispatch.go
Copy link
Copy Markdown
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay before a review. Overall this looks good to me, just a few minor comments.

Nothing major enough to block this change if someone else approves before I get a chance to re-review!

Comment thread dispatch/dispatch.go
Comment thread dispatch/metric.go
Comment thread marker/util.go Outdated
Comment thread marker/util.go Outdated
@siavashs siavashs requested a review from Spaceman1701 April 3, 2026 14:38
@siavashs
Copy link
Copy Markdown
Contributor Author

siavashs commented Apr 4, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
dispatch/dispatch.go (1)

131-133: ⚠️ Potential issue | 🔴 Critical

Unguarded d.metrics access will panic when metrics are disabled.

The nil-check here only guards the alertsCollector.dispatcher.Store call. However, d.metrics is used without nil checks at Lines 157, 264, 278, 451, and 485. When metrics is nil, those calls will cause a nil pointer dereference.

Either guard all metric call sites or normalize nil metrics to a no-op implementation in the constructor.

🛡️ Proposed fix: normalize nil metrics to no-op
 	disp.state.Store(DispatcherStateUnknown)
 	disp.loaded = make(chan struct{})
 	disp.ctx, disp.cancel = context.WithCancel(context.Background())
 
+	if metrics == nil {
+		metrics = NewNoopDispatcherMetrics()
+	}
+
 	if metrics != nil && metrics.alertsCollector != nil {
 		metrics.alertsCollector.dispatcher.Store(disp)
 	}

This requires implementing a NewNoopDispatcherMetrics() function that returns a *DispatcherMetrics with no-op gauge/counter/histogram implementations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/dispatch.go` around lines 131 - 133, The code currently only guards
metrics.alertsCollector.dispatcher.Store(disp) but leaves d.metrics potentially
nil elsewhere (d.metrics used at lines referenced), causing panics; fix by
normalizing nil metrics in the dispatcher constructor: implement
NewNoopDispatcherMetrics() that returns a *DispatcherMetrics whose
gauges/counters/histograms and alertsCollector are no-op implementations, and
replace any assignment that may leave d.metrics nil with a call to
NewNoopDispatcherMetrics() so d.metrics is never nil (alternatively ensure the
constructor always sets d.metrics to NewNoopDispatcherMetrics() when the
incoming metrics parameter is nil).
🧹 Nitpick comments (2)
inhibit/inhibit_test.go (1)

45-50: Assert AlertStateActive on the non-muted path.

The false branch currently accepts both “explicitly cleared” and “never written”. If inhibit.Inhibitor.Mutes stops calling SetInhibited(fp, nil), this helper still passes because InhibitedBy stays empty on an unprocessed marker.

✅ Small test hardening
 	if wantMuted {
 		require.Equal(t, alert.AlertStateSuppressed, status.State, msgAndArgs...)
 		require.NotEmpty(t, status.InhibitedBy, msgAndArgs...)
 	} else {
+		require.Equal(t, alert.AlertStateActive, status.State, msgAndArgs...)
 		require.Empty(t, status.InhibitedBy, msgAndArgs...)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inhibit/inhibit_test.go` around lines 45 - 50, The non-muted branch of the
test only checks that status.InhibitedBy is empty but doesn't assert the state;
update the else branch after calling m.Status(fp) to also require that
status.State equals alert.AlertStateActive so the test fails if an unprocessed
marker leaves the state unset (e.g., if inhibit.Inhibitor.Mutes stops calling
SetInhibited). Keep the existing InhibitedBy emptiness check and add the
AlertStateActive assertion referencing status and alert.AlertStateActive.
api/v2/api.go (1)

483-507: Short-circuit label filters before recomputing status.

api.setAlertStatus reruns the silencer/inhibitor pipeline for every alert here, even when the label matchers already exclude it. On large /alerts and /alerts/groups scans with selective filters, that does unnecessary work in the hot path.

⚡ Suggested reorder
 	if !a.EndsAt.IsZero() && a.EndsAt.Before(now) {
 		return false
 	}
+	if !alertMatchesFilterLabels(&a.Alert, matchers) {
+		return false
+	}
 
 	// Set alert's current status based on its label set.
 	// The inhibitor and silencer write to m via the context.
 	ctx = marker.WithContext(ctx, m)
 	api.setAlertStatus(ctx, a.Labels)
@@
 	if !inhibited && len(status.InhibitedBy) != 0 {
 		return false
 	}
 
-	return alertMatchesFilterLabels(&a.Alert, matchers)
+	return true
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v2/api.go` around lines 483 - 507, The code is calling api.setAlertStatus
(which reruns the silencer/inhibitor pipeline) before applying label filters,
causing unnecessary work; change the order so you call
alertMatchesFilterLabels(&a.Alert, matchers) first and return false when the
alert doesn't match, and only then call marker.WithContext(ctx, m) and
api.setAlertStatus(ctx, a.Labels) and query m.Status(a.Fingerprint()) for the
alerts that passed the label filter; ensure the existing short-circuit checks
that use status (m.Status, status.SilencedBy, status.InhibitedBy, status.State)
remain after setAlertStatus so they only run when needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@marker/context.go`:
- Around line 35-39: Update the comment on FromContext to accurately describe
its behavior: state that it returns the AlertMarker and a boolean indicating
presence, and that when absent it returns nil and false (not a no-op marker).
Reference the FromContext function and keyAlertMarker so readers know where to
look and ensure callers check the returned ok before using the marker.

---

Duplicate comments:
In `@dispatch/dispatch.go`:
- Around line 131-133: The code currently only guards
metrics.alertsCollector.dispatcher.Store(disp) but leaves d.metrics potentially
nil elsewhere (d.metrics used at lines referenced), causing panics; fix by
normalizing nil metrics in the dispatcher constructor: implement
NewNoopDispatcherMetrics() that returns a *DispatcherMetrics whose
gauges/counters/histograms and alertsCollector are no-op implementations, and
replace any assignment that may leave d.metrics nil with a call to
NewNoopDispatcherMetrics() so d.metrics is never nil (alternatively ensure the
constructor always sets d.metrics to NewNoopDispatcherMetrics() when the
incoming metrics parameter is nil).

---

Nitpick comments:
In `@api/v2/api.go`:
- Around line 483-507: The code is calling api.setAlertStatus (which reruns the
silencer/inhibitor pipeline) before applying label filters, causing unnecessary
work; change the order so you call alertMatchesFilterLabels(&a.Alert, matchers)
first and return false when the alert doesn't match, and only then call
marker.WithContext(ctx, m) and api.setAlertStatus(ctx, a.Labels) and query
m.Status(a.Fingerprint()) for the alerts that passed the label filter; ensure
the existing short-circuit checks that use status (m.Status, status.SilencedBy,
status.InhibitedBy, status.State) remain after setAlertStatus so they only run
when needed.

In `@inhibit/inhibit_test.go`:
- Around line 45-50: The non-muted branch of the test only checks that
status.InhibitedBy is empty but doesn't assert the state; update the else branch
after calling m.Status(fp) to also require that status.State equals
alert.AlertStateActive so the test fails if an unprocessed marker leaves the
state unset (e.g., if inhibit.Inhibitor.Mutes stops calling SetInhibited). Keep
the existing InhibitedBy emptiness check and add the AlertStateActive assertion
referencing status and alert.AlertStateActive.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 172d86ce-7958-458a-bd24-de7fac631af1

📥 Commits

Reviewing files that changed from the base of the PR and between 35b9fb6 and 231742a.

📒 Files selected for processing (13)
  • alert/state.go
  • api/v2/api.go
  • dispatch/dispatch.go
  • dispatch/metric.go
  • inhibit/inhibit.go
  • inhibit/inhibit_test.go
  • marker/alert.go
  • marker/alert_test.go
  • marker/context.go
  • notify/mute_test.go
  • notify/notify.go
  • silence/silence.go
  • silence/silence_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • silence/silence_test.go
  • notify/mute_test.go
  • silence/silence.go
  • notify/notify.go

Comment thread marker/context.go Outdated
@siavashs siavashs force-pushed the feat/group_markers branch from 231742a to 364d6ab Compare April 14, 2026 13:14
@siavashs
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
api/v2/api.go (1)

434-470: ⚠️ Potential issue | 🟠 Major

Use each group's marker when serializing /alerts/groups.

This handler rebuilds status into one request-scoped tempMarker keyed only by fingerprint, then reads that same value back for every group. If the same alert appears in multiple aggregation groups with different silencing/inhibition state, the last write wins and every group reports the same status. That loses the per-group marker semantics this refactor is introducing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v2/api.go` around lines 434 - 470, The handler currently uses one
request-scoped tempMarker for all groups so an alert present in multiple groups
can have its status overwritten; instead, create a per-group marker and compute
statuses per group before serializing. After alertGroups is returned, for each
alertGroup allocate groupMarker := marker.NewAlertMarker(), run the
inhibitor/silencer pipeline for that group's alerts (reuse api.alertFilter or
call the underlying setAlertStatus logic with groupMarker) and then call
groupMarker.Status(fp) when constructing apiAlert (replace tempMarker.Status(fp)
with groupMarker.Status(fp)); keep references to api.alertFilter, alertGroups,
tempMarker, and AlertToOpenAPIAlert to locate the spots to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v2/api.go`:
- Around line 479-497: The closure returned by alertFilter reads
api.setAlertStatus without synchronization and can race with Update(); before
constructing the closure in alertFilter, take api.mtx (or api.mtx.RLock/RUnlock
as appropriate) and snapshot the current api.setAlertStatus into a local
variable (e.g., setAlertStatus := api.setAlertStatus), then release the lock and
use that local inside the returned func; reference alertFilter and
setAlertStatus so the closure uses the captured function pointer instead of
accessing api.setAlertStatus directly, ensuring safe concurrent reloads invoked
by Update(), getAlertsHandler, and getAlertGroupsHandler.

In `@dispatch/metric.go`:
- Line 47: The parameter registerLimitMetrics on NewDispatcherMetrics is unused;
either remove it from NewDispatcherMetrics' signature and all callers, or
implement conditional registration: if registerLimitMetrics is true then
register the aggrGroupLimitReached metric (and any related limit metrics) with
the provided prometheus.Registerer `r`, otherwise skip registering it; update
any tests/call sites accordingly so they match the new signature or behavior.

In `@inhibit/inhibit_test.go`:
- Around line 37-53: The helper checkMutes currently only asserts InhibitedBy is
empty when wantMuted is false, which can miss cases where the marker was never
written; update checkMutes (which uses marker.NewAlertMarker,
marker.WithContext, ih.Mutes and m.Status(fp)) to also assert that status.State
== alert.AlertStateActive on the non-muted path (and keep the existing checks
for alert.AlertStateSuppressed and non-empty InhibitedBy when wantMuted is true)
so the test verifies SetInhibited was called for non-muted alerts as well.

---

Outside diff comments:
In `@api/v2/api.go`:
- Around line 434-470: The handler currently uses one request-scoped tempMarker
for all groups so an alert present in multiple groups can have its status
overwritten; instead, create a per-group marker and compute statuses per group
before serializing. After alertGroups is returned, for each alertGroup allocate
groupMarker := marker.NewAlertMarker(), run the inhibitor/silencer pipeline for
that group's alerts (reuse api.alertFilter or call the underlying setAlertStatus
logic with groupMarker) and then call groupMarker.Status(fp) when constructing
apiAlert (replace tempMarker.Status(fp) with groupMarker.Status(fp)); keep
references to api.alertFilter, alertGroups, tempMarker, and AlertToOpenAPIAlert
to locate the spots to change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 68f7566c-1035-48df-85ac-115fcc4985c7

📥 Commits

Reviewing files that changed from the base of the PR and between 231742a and 512409b.

📒 Files selected for processing (31)
  • alert/state.go
  • api/api.go
  • api/v2/api.go
  • api/v2/api_test.go
  • api/v2/compat.go
  • cmd/alertmanager/main.go
  • dispatch/dispatch.go
  • dispatch/dispatch_bench_test.go
  • dispatch/dispatch_test.go
  • dispatch/metric.go
  • featurecontrol/featurecontrol.go
  • inhibit/inhibit.go
  • inhibit/inhibit_bench_test.go
  • inhibit/inhibit_test.go
  • marker/alert.go
  • marker/alert_test.go
  • marker/context.go
  • marker/group.go
  • marker/group_test.go
  • marker/marker.go
  • marker/status.go
  • notify/mute.go
  • notify/mute_test.go
  • notify/notify.go
  • provider/mem/mem.go
  • provider/mem/mem_test.go
  • silence/silence.go
  • silence/silence_bench_test.go
  • silence/silence_test.go
  • types/types.go
  • types/types_test.go
💤 Files with no reviewable changes (4)
  • api/api.go
  • types/types_test.go
  • types/types.go
  • provider/mem/mem.go
✅ Files skipped from review due to trivial changes (7)
  • marker/status.go
  • silence/silence_bench_test.go
  • api/v2/api_test.go
  • marker/marker.go
  • featurecontrol/featurecontrol.go
  • marker/group_test.go
  • dispatch/dispatch_test.go
🚧 Files skipped from review as they are similar to previous changes (12)
  • alert/state.go
  • marker/alert.go
  • cmd/alertmanager/main.go
  • silence/silence_test.go
  • api/v2/compat.go
  • notify/mute_test.go
  • inhibit/inhibit_bench_test.go
  • marker/context.go
  • dispatch/dispatch_bench_test.go
  • inhibit/inhibit.go
  • silence/silence.go
  • dispatch/dispatch.go

Comment thread api/v2/api.go Outdated
Comment thread dispatch/metric.go Outdated
Comment thread inhibit/inhibit_test.go
@siavashs
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
inhibit/inhibit_test.go (1)

47-52: ⚠️ Potential issue | 🟡 Minor

Keep the empty InhibitedBy assertion on the active path.

Adding the AlertStateActive check is good, but dropping require.Empty(t, status.InhibitedBy, ...) weakens this helper again. The non-muted branch should still prove both that the marker was written and that no inhibition IDs were recorded.

Suggested tweak
 	} else {
 		require.Equal(t, alert.AlertStateActive, status.State, msgAndArgs...)
+		require.Empty(t, status.InhibitedBy, msgAndArgs...)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inhibit/inhibit_test.go` around lines 47 - 52, The non-muted branch dropped
the assertion that no inhibition IDs were recorded; when wantMuted is false you
should also assert status.InhibitedBy is empty. Modify the else branch that
checks require.Equal(t, alert.AlertStateActive, status.State, msgAndArgs...) to
additionally call require.Empty(t, status.InhibitedBy, msgAndArgs...) so the
helper verifies both the active state and that no inhibition IDs were written
(referencing wantMuted, status.State, status.InhibitedBy, and AlertStateActive).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v2/api.go`:
- Around line 434-442: The code creates a single request-scoped tempMarker via
marker.NewAlertMarker() and passes it into api.alertFilter, which lets
setAlertStatus mutate that shared tempMarker causing status races across groups;
instead, change the logic so /alerts/groups does not reuse a single tempMarker
for the entire request: compute or fetch each group's actual marker/status
(e.g., call the group's marker lookup or instantiate a fresh marker per group)
and pass that per-group marker into setAlertStatus/api.alertFilter rather than
the global tempMarker, ensuring you stop using tempMarker.Status(fp) as a
request-wide cache; update the code paths around api.alertFilter,
setAlertStatus, and any callers that currently rely on the shared tempMarker
(also adjust the same pattern at the other noted spot around lines 469-470).

---

Duplicate comments:
In `@inhibit/inhibit_test.go`:
- Around line 47-52: The non-muted branch dropped the assertion that no
inhibition IDs were recorded; when wantMuted is false you should also assert
status.InhibitedBy is empty. Modify the else branch that checks require.Equal(t,
alert.AlertStateActive, status.State, msgAndArgs...) to additionally call
require.Empty(t, status.InhibitedBy, msgAndArgs...) so the helper verifies both
the active state and that no inhibition IDs were written (referencing wantMuted,
status.State, status.InhibitedBy, and AlertStateActive).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7fbf439a-752e-4283-977c-8c9f86a7d849

📥 Commits

Reviewing files that changed from the base of the PR and between 512409b and a28f0b4.

📒 Files selected for processing (2)
  • api/v2/api.go
  • inhibit/inhibit_test.go

Comment thread api/v2/api.go
Comment on lines +434 to +442
// Use a temporary marker to "predict" each alert's current status at
// query time (silenced, inhibited, active). The alertFilter calls
// setAlertStatus which runs the inhibitor/silencer pipeline and writes
// the result into tempMarker. We then use tempMarker for the response
// to preserve backward-compatible behavior that clients depend on.
// TODO: consider a feature flag or API query param to let callers opt
// into using the group's actual marker status instead of the prediction.
tempMarker := marker.NewAlertMarker()
af := api.alertFilter(ctx, matchers, *params.Silenced, *params.Inhibited, *params.Active, tempMarker)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t collapse /alerts/groups back onto a shared request marker.

Using one tempMarker for the whole groups request reintroduces the global-marker behavior this PR is trying to remove. The same alert fingerprint can exist in multiple aggregation groups/routes, so later setAlertStatus writes can overwrite earlier state and tempMarker.Status(fp) can report the wrong status for a given group. /alerts/groups needs the group’s actual marker/status, not a request-wide prediction cache.

Also applies to: 469-470

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v2/api.go` around lines 434 - 442, The code creates a single
request-scoped tempMarker via marker.NewAlertMarker() and passes it into
api.alertFilter, which lets setAlertStatus mutate that shared tempMarker causing
status races across groups; instead, change the logic so /alerts/groups does not
reuse a single tempMarker for the entire request: compute or fetch each group's
actual marker/status (e.g., call the group's marker lookup or instantiate a
fresh marker per group) and pass that per-group marker into
setAlertStatus/api.alertFilter rather than the global tempMarker, ensuring you
stop using tempMarker.Status(fp) as a request-wide cache; update the code paths
around api.alertFilter, setAlertStatus, and any callers that currently rely on
the shared tempMarker (also adjust the same pattern at the other noted spot
around lines 469-470).

Comment thread dispatch/metric.go Outdated
Comment thread dispatch/metric.go
c.emit(ch, float64(suppressed), labelValues...)
labelValues[0] = string(alert.AlertStateUnprocessed)
c.emit(ch, float64(unprocessed), labelValues...)
return true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here do we want to exclude groups that have 0 in each state? or not?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not, because these are gauge metrics and can drop to zero, so I assume having zero values in between changes is desired compared to missing metrics.
What do you think?

Comment thread api/v2/api.go
// Set alert's current status based on its label set.
api.setAlertStatus(ctx, a.Labels)
// The inhibitor and silencer write to m via the context.
ctx = marker.WithContext(ctx, m)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to create this context outside the function itself? This way we would avoid creating a new context each time the returned func is called

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we need to keep it as is since each call to the returned function is a new operation that we want to trace as a child span.
This specific call returns a derived context which is pointing tot he parent.

Comment thread marker/group.go
status = &groupStatus{}
m.groups[key] = status
}
status.mutedBy = timeIntervalNames
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If timeIntervalNames is nil or empty, shoudl we actually delete from the map, here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but this is just moving existing logic and is not new code.

Comment thread silence/silence.go
"pending", len(allIDs)-len(activeIDs),
)
// TODO: remove this sort once the marker is removed.
sort.Strings(activeIDs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we shouldn't still sort for the API to be consistent and the results comparable? not necessarily here, but somewhere...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but does it matter if the IDs are in a specific order or do we guarantee this?

Comment thread marker/group.go
status = &groupStatus{}
m.groups[key] = status
}
status.mutedBy = timeIntervalNames
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places you clone the slice, but not here... Any particular reason?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just moving existing logic and is not new code.
The PR is not focused on Group Markers used by time intervals so I would leave any changes related to those for later.

siavashs added 12 commits April 20, 2026 20:49
This change adds alert markers to the aggregation groups in dispatcher.
Alert markers replace the global marker and are used to track
the state of alerts in each aggregation group.

This change touches many components of the alertmanager.
Per Group alert markers are passed to the notifiers and then inhibitor
and silencer using context.

The API has no breaking changes:
- /alerts uses a temporary marker to track the state of alerts
- /alerts/groups returns the group markers

Update metrics to use group markers.
The `alertmanager_alerts` metric is moved to dispatcher.
The `alertmanager_marked_alerts` metric is removed.
By default it behaves the same as before, by aggregating
all alerts in the groups.
Enabling `group-key-in-metrics` flag will cause the metrics
to be grouped by `group_key`.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs siavashs force-pushed the feat/group_markers branch from a28f0b4 to db964e5 Compare April 20, 2026 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrated from Global Marker to per Aggregation Group markers

5 participants