feat(dispatch): add alert markers per group#5047
feat(dispatch): add alert markers per group#5047siavashs wants to merge 12 commits intoprometheus:mainfrom
Conversation
cece42d to
b5e2104
Compare
b5e2104 to
b4055cb
Compare
|
Reviewed the new marker package standalone. LGTM |
| } | ||
|
|
||
| // SetInhibited implements AlertMarker. | ||
| func (m *alertMarker) SetInhibited(fp model.Fingerprint, inhibitedBy []string) { |
There was a problem hiding this comment.
Just out of curiosity the previous SetIhibited used
func (m *MemMarker) SetInhibited(alert model.Fingerprint, ids ...string) {
And we changed it to []string ... Is there a reason we changed?
There was a problem hiding this comment.
Just for consistency in the interface API for both SetInhibited and SetSilenced.
| delete(m.groups, routeID+groupKey) | ||
| } | ||
|
|
||
| func (m *MemMarker) registerMetrics(r prometheus.Registerer) { |
There was a problem hiding this comment.
We should note in the release note that we removed this metric, as it would be too expensive to calculate it across all markers... ? Unless each time we change it a marker we were to add/remove the relevant number from the metric... But yeah, if not worth it/useful definitely we should say it's gone
There was a problem hiding this comment.
I put it in the PR description:
[CHANGE] Remove `alertmanager_marked_alerts`
Do we need to also update any MD files? cc @SoloJacobs
| } else { | ||
| filtered = append(filtered, a) | ||
| } | ||
| // TODO(fabxc): increment muted alerts counter if muted. |
There was a problem hiding this comment.
Are this TODO and the one above still relevant?
There was a problem hiding this comment.
I think he wanted to add/change the metrics here.
I am not sure if we need it.
I can remove these but my changes don't touch these.
724c634 to
9a5c33e
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMigrates from a global marker to per-aggregation-group markers, removes callback-based alert-status plumbing (Options.AlertStatusFunc), introduces a marker package with context propagation, switches code from Changes
Sequence DiagramsequenceDiagram
participant Client
participant APIv2
participant Dispatcher
participant GroupMarker
participant AlertMarker
participant Context
Client->>APIv2: GET /api/v2/alerts
APIv2->>Dispatcher: request groups + alertFilter(ctx)
Dispatcher->>GroupMarker: ensure group-level marker
Dispatcher->>AlertMarker: create NewAlertMarker() per-aggregation-group
Dispatcher->>Context: ctx' = marker.WithContext(ctx, alertMarker)
Dispatcher->>APIv2: return AlertGroup(s) with attached ctx'
APIv2->>AlertMarker: Status(fp) via marker.FromContext(ctx') for each alert
AlertMarker-->>APIv2: alert.AlertStatus (state, silenced, inhibited)
APIv2-->>Client: OpenAPI response enriched with per-alert statuses
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
notify/notify.go (1)
37-43:⚠️ Potential issue | 🟡 MinorRemove unused
typesimport fromdispatch/dispatch_test.go.The
typesimport innotify/notify.go(line 43) is needed and used at line 887 ([]*types.Alert). However, the same import indispatch/dispatch_test.go(line 36) is unused and should be removed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/notify.go` around lines 37 - 43, Remove the unused "types" import from dispatch/dispatch_test.go: open the import block in dispatch/dispatch_test.go and delete the "github.com/prometheus/alertmanager/types" entry (it's not referenced in that test), then run `go vet`/`go test` to ensure no other usages; do not change the existing usage of types in notify/notify.go (e.g., the []*types.Alert at line 887) which must remain.
🧹 Nitpick comments (3)
marker/alert.go (1)
38-48: Copy the reason slices at the marker boundary.
SetSilenced/SetInhibitedkeep the caller’s slice as-is, andStatusreturns the same backing arrays back out. That makes the marker’s internal state externally mutable if a caller reuses or edits either slice later. Clone on write or before returning so the marker fully owns its state.Also applies to: 51-61, 79-85
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@marker/alert.go` around lines 38 - 48, SetSilenced, SetInhibited and Status currently reuse caller slices and return the same backing arrays, exposing internal state; fix by copying slices on write and before returning so the marker owns its data. In alertMarker.SetSilenced and alertMarker.SetInhibited, replace assignments to alertStatus.SilencedBy / InhibitedBy with a copied slice (make + copy) of the incoming slice; in alertMarker.Status, when constructing the returned Alert or alertStatus, clone any slices you return (SilencedBy, InhibitedBy, and Reason slices if present) so callers never get pointers to the marker’s internal backing arrays.dispatch/metric.go (1)
35-79:registerLimitMetricsis ignored here.
NewDispatcherMetricsstill registersalertmanager_dispatcher_aggregation_group_limit_reached_totalunconditionally, so callers passingfalseno longer get the behavior the signature advertises. Either wire the flag back into registration or drop the parameter to avoid a misleading API.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dispatch/metric.go` around lines 35 - 79, The function NewDispatcherMetrics ignores the registerLimitMetrics parameter but still unconditionally creates/registers the aggrGroupLimitReached counter; update NewDispatcherMetrics so that when registerLimitMetrics is false it does not create or register alertmanager_dispatcher_aggregation_group_limit_reached_total (e.g., set aggrGroupLimitReached to nil or a no-op metric) and only call promauto.With(r).NewCounter for aggrGroupLimitReached when registerLimitMetrics is true; alternatively, if the flag is unnecessary, remove the registerLimitMetrics parameter and its callers—make the change in NewDispatcherMetrics and the DispatcherMetrics struct usage (aggrGroupLimitReached) accordingly.marker/group_test.go (1)
53-75: Consider adding a concurrent access test.Given that
GroupMarkerusessync.RWMutexfor thread safety, a test exercising concurrentSetMuted/Muted/DeleteByGroupKeycalls would help validate the locking behavior under contention.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@marker/group_test.go` around lines 53 - 75, Add a new test that creates a GroupMarker via NewGroupMarker and runs concurrent goroutines calling SetMuted, Muted, and DeleteByGroupKey to exercise the RWMutex under contention; use sync.WaitGroup and multiple iterations with randomized sleeps to interleave operations, assert no panics/races and that final state matches expected (e.g., set a key, concurrently read and delete it, then verify Muted returns false after DeleteByGroupKey). Target the methods NewGroupMarker, SetMuted, Muted, and DeleteByGroupKey in the test to validate thread-safety.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dispatch/dispatch_test.go`:
- Line 37: The build is failing because the import
"github.com/prometheus/alertmanager/types" is unused in dispatch_test.go; remove
that import from the file's import block (or replace it with a blank identifier
only if intentionally needed) so the test compiles—locate the import declaration
referencing types in dispatch_test.go and delete it or remove any references to
types.* so the compiler no longer reports an unused import.
In `@dispatch/metric.go`:
- Around line 124-149: The seen map population in the loop over
d.routeGroupsSlice (aggrGroup -> ag.alerts.List()) uses first-write-wins; change
it so on each fingerprint fp encountered you compute the incoming state via
ag.marker.Status(fp).State and merge it with any existing seen[fp] using
explicit precedence suppressed > active > unprocessed (i.e., if existing is
lower precedence than incoming, replace it; otherwise keep existing). Keep the
rest of the counting logic over seen unchanged.
In `@notify/notify.go`:
- Line 887: The method RetryStage.exec currently accepts variadic alerts of type
*alert.Alert but declares it returns []*types.Alert and then returns the
incoming alerts directly, causing a type mismatch; fix by converting the
incoming []*alert.Alert to []*types.Alert before returning (or change the
parameter type to []*types.Alert if callers can be updated). Locate
RetryStage.exec and either (A) map each *alert.Alert to a new *types.Alert
(create a small converter inside exec or call an existing conversion function)
and return that slice, or (B) change the exec signature to accept alerts
...*types.Alert and update call sites accordingly so the returned value matches
the declared []*types.Alert.
---
Outside diff comments:
In `@notify/notify.go`:
- Around line 37-43: Remove the unused "types" import from
dispatch/dispatch_test.go: open the import block in dispatch/dispatch_test.go
and delete the "github.com/prometheus/alertmanager/types" entry (it's not
referenced in that test), then run `go vet`/`go test` to ensure no other usages;
do not change the existing usage of types in notify/notify.go (e.g., the
[]*types.Alert at line 887) which must remain.
---
Nitpick comments:
In `@dispatch/metric.go`:
- Around line 35-79: The function NewDispatcherMetrics ignores the
registerLimitMetrics parameter but still unconditionally creates/registers the
aggrGroupLimitReached counter; update NewDispatcherMetrics so that when
registerLimitMetrics is false it does not create or register
alertmanager_dispatcher_aggregation_group_limit_reached_total (e.g., set
aggrGroupLimitReached to nil or a no-op metric) and only call
promauto.With(r).NewCounter for aggrGroupLimitReached when registerLimitMetrics
is true; alternatively, if the flag is unnecessary, remove the
registerLimitMetrics parameter and its callers—make the change in
NewDispatcherMetrics and the DispatcherMetrics struct usage
(aggrGroupLimitReached) accordingly.
In `@marker/alert.go`:
- Around line 38-48: SetSilenced, SetInhibited and Status currently reuse caller
slices and return the same backing arrays, exposing internal state; fix by
copying slices on write and before returning so the marker owns its data. In
alertMarker.SetSilenced and alertMarker.SetInhibited, replace assignments to
alertStatus.SilencedBy / InhibitedBy with a copied slice (make + copy) of the
incoming slice; in alertMarker.Status, when constructing the returned Alert or
alertStatus, clone any slices you return (SilencedBy, InhibitedBy, and Reason
slices if present) so callers never get pointers to the marker’s internal
backing arrays.
In `@marker/group_test.go`:
- Around line 53-75: Add a new test that creates a GroupMarker via
NewGroupMarker and runs concurrent goroutines calling SetMuted, Muted, and
DeleteByGroupKey to exercise the RWMutex under contention; use sync.WaitGroup
and multiple iterations with randomized sleeps to interleave operations, assert
no panics/races and that final state matches expected (e.g., set a key,
concurrently read and delete it, then verify Muted returns false after
DeleteByGroupKey). Target the methods NewGroupMarker, SetMuted, Muted, and
DeleteByGroupKey in the test to validate thread-safety.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 62685eb0-6c40-469e-87cc-0d7fd5712f7b
📒 Files selected for processing (30)
api/api.goapi/v2/api.goapi/v2/api_test.goapi/v2/compat.gocmd/alertmanager/main.godispatch/dispatch.godispatch/dispatch_bench_test.godispatch/dispatch_test.godispatch/metric.gofeaturecontrol/featurecontrol.goinhibit/inhibit.goinhibit/inhibit_bench_test.goinhibit/inhibit_test.gomarker/alert.gomarker/alert_test.gomarker/group.gomarker/group_test.gomarker/marker.gomarker/status.gomarker/util.gonotify/mute.gonotify/mute_test.gonotify/notify.goprovider/mem/mem.goprovider/mem/mem_test.gosilence/silence.gosilence/silence_bench_test.gosilence/silence_test.gotypes/types.gotypes/types_test.go
💤 Files with no reviewable changes (4)
- api/api.go
- types/types_test.go
- provider/mem/mem.go
- types/types.go
✅ Files skipped from review due to trivial changes (3)
- marker/status.go
- featurecontrol/featurecontrol.go
- api/v2/api_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
- inhibit/inhibit_bench_test.go
- notify/mute_test.go
- silence/silence_test.go
- notify/mute.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
inhibit/inhibit_test.go (1)
312-356:⚠️ Potential issue | 🟡 MinorTest case comments are misleading and don't match expected values.
Several comments in this test don't align with the expected results. This appears to be copy-paste from
TestInhibitRuleMatcheswithout updating for the different matcher types (rule1 usesMatchNotEqualfort1, notMatchEqual).Examples:
- Line 313: Comment says "inhibited" but
expected: false- Line 335: Comment says "not inhibited" but
expected: true- Line 349: Comment says "not inhibited" but
expected: trueThe expected values themselves appear correct for the
MatchNotEqualsemantics, but the comments should be updated to avoid confusion for future maintainers.📝 Suggested comment fixes
{ - // Matches target filter of rule1, inhibited. + // Does NOT match target filter of rule1 (MatchNotEqual t1!=1), not inhibited. target: model.LabelSet{"t1": "1", "e": "1"}, expected: false, },{ - // Doesn't match target filter, not inhibited. + // Matches target filter of rule1 (t1!=1, so t1=0 matches), inhibited. target: model.LabelSet{"t1": "0", "e": "1"}, expected: true, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inhibit/inhibit_test.go` around lines 312 - 356, Update the misleading inline comments in the inhibit test cases to reflect the actual semantics of rule1 (which uses MatchNotEqual for label "t1") rather than MatchEqual; specifically, review each test case referencing rule1 and rule2 (symbols: rule1, rule2, MatchNotEqual) and change the comment text to match the existing expected boolean values (keep expected values as-is), e.g. where a comment currently says "inhibited" but expected:false change it to "not inhibited (because MatchNotEqual on t1)" and where it says "not inhibited" but expected:true change to "inhibited (because MatchNotEqual on t1 does not block)"; ensure all comments describing source/target matching (mentions of sourceAlert1, sourceAlert2, target filters) are consistent with MatchNotEqual semantics.
♻️ Duplicate comments (2)
dispatch/metric.go (1)
124-153:⚠️ Potential issue | 🟠 MajorMerge duplicate fingerprint states with explicit precedence.
With per-group markers, the same fingerprint can be
activein one aggregation group andsuppressedin another. The current first-write-wins logic inseen(line 132-134) makes the metric non-deterministic depending on iteration order. Consider merging with explicit precedence (e.g.,suppressed > active > unprocessed).🛠️ Proposed fix with merge precedence
// Deduplicate by fingerprint for backward compatibility. seen := map[model.Fingerprint]alert.AlertState{} for i := range d.routeGroupsSlice { d.routeGroupsSlice[i].groups.Range(func(_, el any) bool { ag := el.(*aggrGroup) for _, a := range ag.alerts.List() { fp := a.Fingerprint() if !a.Resolved() { - if _, ok := seen[fp]; !ok { - seen[fp] = ag.marker.Status(fp).State - } + incoming := ag.marker.Status(fp).State + existing, ok := seen[fp] + if !ok || statePrecedence(incoming) > statePrecedence(existing) { + seen[fp] = incoming + } } } return true }) } +// statePrecedence returns the precedence of the state for merging. +// Higher precedence wins when the same fingerprint appears in multiple groups. +func statePrecedence(s alert.AlertState) int { + switch s { + case alert.AlertStateSuppressed: + return 2 + case alert.AlertStateActive: + return 1 + default: // unprocessed + return 0 + } +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dispatch/metric.go` around lines 124 - 153, The deduplication currently uses first-write-wins in the seen map (seen := map[model.Fingerprint]alert.AlertState{}) when iterating d.routeGroupsSlice and aggrGroup alerts, causing non-deterministic metric values; change the logic inside the loop that does fp := a.Fingerprint() and sets seen[fp] = ag.marker.Status(fp).State to instead merge states by explicit precedence: treat AlertStateSuppressed highest, AlertStateActive next, AlertStateUnprocessed lowest, and when a fingerprint already exists in seen compare the existing state and the new ag.marker.Status(fp).State and replace it only if the new state has higher precedence so the final seen map deterministically reflects suppressed > active > unprocessed before the counting/emission.notify/notify.go (1)
887-887:⚠️ Potential issue | 🔴 CriticalCritical: Return type mismatch will cause a compile error.
The
execmethod declares return type[]*types.Alertbut receivesalerts ...*alert.Alertand returnsalertsdirectly. This type mismatch won't compile.🐛 Proposed fix
-func (r RetryStage) exec(ctx context.Context, l *slog.Logger, alerts ...*alert.Alert) (context.Context, []*types.Alert, error) { +func (r RetryStage) exec(ctx context.Context, l *slog.Logger, alerts ...*alert.Alert) (context.Context, []*alert.Alert, error) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/notify.go` at line 887, The exec method signature on RetryStage uses parameter type alerts ...*alert.Alert but declares it returns []*types.Alert, causing a type mismatch; fix by changing the parameter type to ...*types.Alert in the RetryStage.exec signature (and update all callers to pass types.Alert pointers), or alternatively, keep alerts as ...*alert.Alert and convert/marshal each alert.Alert into a *types.Alert slice before returning; refer to the function name RetryStage.exec and the symbols alerts, alert.Alert and types.Alert when making the change.
🧹 Nitpick comments (1)
marker/group_test.go (1)
22-75: LGTM!The tests provide good coverage of the
GroupMarkerfunctionality including isolation between different route/group combinations and clearing behavior.Consider adding a concurrent access test in the future to validate thread safety under load, though this is not critical for the initial implementation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@marker/group_test.go` around lines 22 - 75, Add a concurrent-access unit test for GroupMarker to validate thread-safety: create a test (e.g., TestGroupMarker_ConcurrentAccess) that exercises GroupMarker.SetMuted, GroupMarker.Muted and GroupMarker.DeleteByGroupKey from many goroutines using sync.WaitGroup and randomized or patterned operations, wait for completion, and assert final invariants (for example that Muted returns expected timeIntervalNames or empty after deletes); ensure the test covers mixed reads/writes across multiple route/group combinations (route1/route2, group1/group2) to catch race conditions and run with -race in CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@inhibit/inhibit_test.go`:
- Around line 312-356: Update the misleading inline comments in the inhibit test
cases to reflect the actual semantics of rule1 (which uses MatchNotEqual for
label "t1") rather than MatchEqual; specifically, review each test case
referencing rule1 and rule2 (symbols: rule1, rule2, MatchNotEqual) and change
the comment text to match the existing expected boolean values (keep expected
values as-is), e.g. where a comment currently says "inhibited" but
expected:false change it to "not inhibited (because MatchNotEqual on t1)" and
where it says "not inhibited" but expected:true change to "inhibited (because
MatchNotEqual on t1 does not block)"; ensure all comments describing
source/target matching (mentions of sourceAlert1, sourceAlert2, target filters)
are consistent with MatchNotEqual semantics.
---
Duplicate comments:
In `@dispatch/metric.go`:
- Around line 124-153: The deduplication currently uses first-write-wins in the
seen map (seen := map[model.Fingerprint]alert.AlertState{}) when iterating
d.routeGroupsSlice and aggrGroup alerts, causing non-deterministic metric
values; change the logic inside the loop that does fp := a.Fingerprint() and
sets seen[fp] = ag.marker.Status(fp).State to instead merge states by explicit
precedence: treat AlertStateSuppressed highest, AlertStateActive next,
AlertStateUnprocessed lowest, and when a fingerprint already exists in seen
compare the existing state and the new ag.marker.Status(fp).State and replace it
only if the new state has higher precedence so the final seen map
deterministically reflects suppressed > active > unprocessed before the
counting/emission.
In `@notify/notify.go`:
- Line 887: The exec method signature on RetryStage uses parameter type alerts
...*alert.Alert but declares it returns []*types.Alert, causing a type mismatch;
fix by changing the parameter type to ...*types.Alert in the RetryStage.exec
signature (and update all callers to pass types.Alert pointers), or
alternatively, keep alerts as ...*alert.Alert and convert/marshal each
alert.Alert into a *types.Alert slice before returning; refer to the function
name RetryStage.exec and the symbols alerts, alert.Alert and types.Alert when
making the change.
---
Nitpick comments:
In `@marker/group_test.go`:
- Around line 22-75: Add a concurrent-access unit test for GroupMarker to
validate thread-safety: create a test (e.g., TestGroupMarker_ConcurrentAccess)
that exercises GroupMarker.SetMuted, GroupMarker.Muted and
GroupMarker.DeleteByGroupKey from many goroutines using sync.WaitGroup and
randomized or patterned operations, wait for completion, and assert final
invariants (for example that Muted returns expected timeIntervalNames or empty
after deletes); ensure the test covers mixed reads/writes across multiple
route/group combinations (route1/route2, group1/group2) to catch race conditions
and run with -race in CI.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 14704549-fc4f-4c3d-99d7-c49241be9280
📒 Files selected for processing (30)
api/api.goapi/v2/api.goapi/v2/api_test.goapi/v2/compat.gocmd/alertmanager/main.godispatch/dispatch.godispatch/dispatch_bench_test.godispatch/dispatch_test.godispatch/metric.gofeaturecontrol/featurecontrol.goinhibit/inhibit.goinhibit/inhibit_bench_test.goinhibit/inhibit_test.gomarker/alert.gomarker/alert_test.gomarker/group.gomarker/group_test.gomarker/marker.gomarker/status.gomarker/util.gonotify/mute.gonotify/mute_test.gonotify/notify.goprovider/mem/mem.goprovider/mem/mem_test.gosilence/silence.gosilence/silence_bench_test.gosilence/silence_test.gotypes/types.gotypes/types_test.go
💤 Files with no reviewable changes (4)
- types/types_test.go
- types/types.go
- provider/mem/mem.go
- api/api.go
✅ Files skipped from review due to trivial changes (9)
- marker/status.go
- silence/silence_bench_test.go
- marker/alert.go
- inhibit/inhibit_bench_test.go
- marker/alert_test.go
- dispatch/dispatch_test.go
- marker/marker.go
- featurecontrol/featurecontrol.go
- silence/silence.go
🚧 Files skipped from review as they are similar to previous changes (6)
- cmd/alertmanager/main.go
- api/v2/api_test.go
- marker/util.go
- silence/silence_test.go
- notify/mute_test.go
- notify/mute.go
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
dispatch/metric.go (1)
124-149:⚠️ Potential issue | 🟠 MajorDeduped alert-state metrics are still order-dependent.
With per-group markers, the same fingerprint can legitimately have different states in different aggregation groups. First-write-wins on
seenmakes the defaultalertmanager_alertsresult depend onsync.Map.Rangeorder instead of a defined precedence.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dispatch/metric.go` around lines 124 - 149, The dedupe uses first-write-wins on seen which makes results order-dependent; change the assignment logic inside the loop that populates seen (where seen is map[model.Fingerprint]alert.AlertState and the code calls ag.marker.Status(fp).State) to apply a deterministic precedence instead of skipping when a key exists: compute the newState := ag.marker.Status(fp).State and update seen[fp] only if newState has higher precedence than the current seen[fp] (e.g., prefer alert.AlertStateActive over alert.AlertStateSuppressed over others), or implement a small helper priority function to compare states before assigning.
🧹 Nitpick comments (3)
cmd/alertmanager/main.go (1)
291-291: Minor: Variable shadows package name.The local variable
markershadows the importedmarkerpackage. While this works correctly (the package is only used for theNewGroupMarker()call at initialization), consider renaming togroupMarkerfor clarity.♻️ Suggested rename for clarity
- marker := marker.NewGroupMarker() + groupMarker := marker.NewGroupMarker()Then update references at lines 381, 492, 510 accordingly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/alertmanager/main.go` at line 291, The local variable named marker shadows the imported marker package; rename the local variable returned by marker.NewGroupMarker() to groupMarker (e.g., groupMarker := marker.NewGroupMarker()) and update all intra-file references that use that local variable (formerly marker) to groupMarker so the package identifier marker remains unshadowed—ensure you update every use site that currently refers to the local variable (the subsequent references in this file).dispatch/dispatch_test.go (1)
433-438: Assert theAlertStatusescontents, not just that the map exists.This only verifies allocation. A regression that populates every entry with the wrong state would still pass, even though
Groups()now exposes these per-alert statuses.✅ Small test hardening
for _, ag := range alertGroups { require.NotNil(t, ag.AlertStatuses) + for _, a := range ag.Alerts { + status, ok := ag.AlertStatuses[a.Fingerprint()] + require.True(t, ok) + require.Equal(t, alert.AlertStateActive, status.State) + } ag.AlertStatuses = nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dispatch/dispatch_test.go` around lines 433 - 438, Test currently only checks AlertStatuses is non-nil then clears it; instead assert the actual contents before nil-ing so regressions in per-alert state are caught. In the loop over alertGroups (variable alertGroups, field AlertStatuses) replace the single require.NotNil with a check that AlertStatuses equals the expected map for that group (use require.Equal or similar against your expected per-alert status map for this test), then set ag.AlertStatuses = nil so the later struct comparison still works; ensure you reference the same keys/values that Groups() should expose.dispatch/metric.go (1)
36-79:registerLimitMetricsis currently ignored.
aggrGroupLimitReachedis always registered withpromauto.With(r).NewCounter, so callers passingfalsestill exposealertmanager_dispatcher_aggregation_group_limit_reached_total. Either drop the flag or make the false branch create an unregistered counter.🛠️ One local way to honor the flag
- aggrGroupLimitReached: promauto.With(r).NewCounter( - prometheus.CounterOpts{ - Name: "alertmanager_dispatcher_aggregation_group_limit_reached_total", - Help: "Number of times when dispatcher failed to create new aggregation group due to limit.", - }, - ), + aggrGroupLimitReached: func() prometheus.Counter { + opts := prometheus.CounterOpts{ + Name: "alertmanager_dispatcher_aggregation_group_limit_reached_total", + Help: "Number of times when dispatcher failed to create new aggregation group due to limit.", + } + if !registerLimitMetrics || r == nil { + return prometheus.NewCounter(opts) + } + return promauto.With(r).NewCounter(opts) + }(),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dispatch/metric.go` around lines 36 - 79, The registerLimitMetrics boolean is ignored in NewDispatcherMetrics causing aggrGroupLimitReached to always be registered; update NewDispatcherMetrics so that when registerLimitMetrics is true you create and register the counter with promauto.With(r).NewCounter (as now), and when false you create an unregistered counter using prometheus.NewCounter(prometheus.CounterOpts{...}) (or a no-op implementation) and do not register it with r; refer to the NewDispatcherMetrics function and the aggrGroupLimitReached symbol to implement the conditional creation/registration so callers passing false no longer expose alertmanager_dispatcher_aggregation_group_limit_reached_total.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dispatch/dispatch.go`:
- Around line 131-133: Normalize nil metrics when constructing the dispatcher:
if the incoming metrics or metrics.alertsCollector is nil, replace them with a
no-op implementation so subsequent unguarded calls on d.metrics (e.g., in
methods referencing d.metrics at the sites where you currently call
d.metrics.Alert.../d.metrics.*) are safe. Implement a simple no-op type that
satisfies the DispatcherMetrics interface (or the concrete shape returned by
NewDispatcherMetrics) with an alertsCollector containing a dispatcher.Store
method that is a no-op, then in the dispatcher constructor (where you currently
call metrics.alertsCollector.dispatcher.Store(disp)) set metrics =
NewNoopDispatcherMetrics() when metrics == nil or metrics.alertsCollector == nil
before storing the dispatcher.
---
Duplicate comments:
In `@dispatch/metric.go`:
- Around line 124-149: The dedupe uses first-write-wins on seen which makes
results order-dependent; change the assignment logic inside the loop that
populates seen (where seen is map[model.Fingerprint]alert.AlertState and the
code calls ag.marker.Status(fp).State) to apply a deterministic precedence
instead of skipping when a key exists: compute the newState :=
ag.marker.Status(fp).State and update seen[fp] only if newState has higher
precedence than the current seen[fp] (e.g., prefer alert.AlertStateActive over
alert.AlertStateSuppressed over others), or implement a small helper priority
function to compare states before assigning.
---
Nitpick comments:
In `@cmd/alertmanager/main.go`:
- Line 291: The local variable named marker shadows the imported marker package;
rename the local variable returned by marker.NewGroupMarker() to groupMarker
(e.g., groupMarker := marker.NewGroupMarker()) and update all intra-file
references that use that local variable (formerly marker) to groupMarker so the
package identifier marker remains unshadowed—ensure you update every use site
that currently refers to the local variable (the subsequent references in this
file).
In `@dispatch/dispatch_test.go`:
- Around line 433-438: Test currently only checks AlertStatuses is non-nil then
clears it; instead assert the actual contents before nil-ing so regressions in
per-alert state are caught. In the loop over alertGroups (variable alertGroups,
field AlertStatuses) replace the single require.NotNil with a check that
AlertStatuses equals the expected map for that group (use require.Equal or
similar against your expected per-alert status map for this test), then set
ag.AlertStatuses = nil so the later struct comparison still works; ensure you
reference the same keys/values that Groups() should expose.
In `@dispatch/metric.go`:
- Around line 36-79: The registerLimitMetrics boolean is ignored in
NewDispatcherMetrics causing aggrGroupLimitReached to always be registered;
update NewDispatcherMetrics so that when registerLimitMetrics is true you create
and register the counter with promauto.With(r).NewCounter (as now), and when
false you create an unregistered counter using
prometheus.NewCounter(prometheus.CounterOpts{...}) (or a no-op implementation)
and do not register it with r; refer to the NewDispatcherMetrics function and
the aggrGroupLimitReached symbol to implement the conditional
creation/registration so callers passing false no longer expose
alertmanager_dispatcher_aggregation_group_limit_reached_total.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0d731d28-13a3-4d8a-929c-6c07a9cf7c9e
📒 Files selected for processing (30)
api/api.goapi/v2/api.goapi/v2/api_test.goapi/v2/compat.gocmd/alertmanager/main.godispatch/dispatch.godispatch/dispatch_bench_test.godispatch/dispatch_test.godispatch/metric.gofeaturecontrol/featurecontrol.goinhibit/inhibit.goinhibit/inhibit_bench_test.goinhibit/inhibit_test.gomarker/alert.gomarker/alert_test.gomarker/group.gomarker/group_test.gomarker/marker.gomarker/status.gomarker/util.gonotify/mute.gonotify/mute_test.gonotify/notify.goprovider/mem/mem.goprovider/mem/mem_test.gosilence/silence.gosilence/silence_bench_test.gosilence/silence_test.gotypes/types.gotypes/types_test.go
💤 Files with no reviewable changes (4)
- api/api.go
- types/types_test.go
- provider/mem/mem.go
- types/types.go
✅ Files skipped from review due to trivial changes (5)
- marker/status.go
- marker/group_test.go
- marker/marker.go
- notify/notify.go
- marker/group.go
🚧 Files skipped from review as they are similar to previous changes (8)
- api/v2/api_test.go
- marker/alert.go
- inhibit/inhibit_bench_test.go
- silence/silence_test.go
- featurecontrol/featurecontrol.go
- provider/mem/mem_test.go
- dispatch/dispatch_bench_test.go
- silence/silence.go
Spaceman1701
left a comment
There was a problem hiding this comment.
Sorry for the long delay before a review. Overall this looks good to me, just a few minor comments.
Nothing major enough to block this change if someone else approves before I get a chance to re-review!
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
dispatch/dispatch.go (1)
131-133:⚠️ Potential issue | 🔴 CriticalUnguarded
d.metricsaccess will panic when metrics are disabled.The nil-check here only guards the
alertsCollector.dispatcher.Storecall. However,d.metricsis used without nil checks at Lines 157, 264, 278, 451, and 485. Whenmetricsisnil, those calls will cause a nil pointer dereference.Either guard all metric call sites or normalize
nilmetrics to a no-op implementation in the constructor.🛡️ Proposed fix: normalize nil metrics to no-op
disp.state.Store(DispatcherStateUnknown) disp.loaded = make(chan struct{}) disp.ctx, disp.cancel = context.WithCancel(context.Background()) + if metrics == nil { + metrics = NewNoopDispatcherMetrics() + } + if metrics != nil && metrics.alertsCollector != nil { metrics.alertsCollector.dispatcher.Store(disp) }This requires implementing a
NewNoopDispatcherMetrics()function that returns a*DispatcherMetricswith no-op gauge/counter/histogram implementations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dispatch/dispatch.go` around lines 131 - 133, The code currently only guards metrics.alertsCollector.dispatcher.Store(disp) but leaves d.metrics potentially nil elsewhere (d.metrics used at lines referenced), causing panics; fix by normalizing nil metrics in the dispatcher constructor: implement NewNoopDispatcherMetrics() that returns a *DispatcherMetrics whose gauges/counters/histograms and alertsCollector are no-op implementations, and replace any assignment that may leave d.metrics nil with a call to NewNoopDispatcherMetrics() so d.metrics is never nil (alternatively ensure the constructor always sets d.metrics to NewNoopDispatcherMetrics() when the incoming metrics parameter is nil).
🧹 Nitpick comments (2)
inhibit/inhibit_test.go (1)
45-50: AssertAlertStateActiveon the non-muted path.The false branch currently accepts both “explicitly cleared” and “never written”. If
inhibit.Inhibitor.Mutesstops callingSetInhibited(fp, nil), this helper still passes becauseInhibitedBystays empty on an unprocessed marker.✅ Small test hardening
if wantMuted { require.Equal(t, alert.AlertStateSuppressed, status.State, msgAndArgs...) require.NotEmpty(t, status.InhibitedBy, msgAndArgs...) } else { + require.Equal(t, alert.AlertStateActive, status.State, msgAndArgs...) require.Empty(t, status.InhibitedBy, msgAndArgs...) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inhibit/inhibit_test.go` around lines 45 - 50, The non-muted branch of the test only checks that status.InhibitedBy is empty but doesn't assert the state; update the else branch after calling m.Status(fp) to also require that status.State equals alert.AlertStateActive so the test fails if an unprocessed marker leaves the state unset (e.g., if inhibit.Inhibitor.Mutes stops calling SetInhibited). Keep the existing InhibitedBy emptiness check and add the AlertStateActive assertion referencing status and alert.AlertStateActive.api/v2/api.go (1)
483-507: Short-circuit label filters before recomputing status.
api.setAlertStatusreruns the silencer/inhibitor pipeline for every alert here, even when the label matchers already exclude it. On large/alertsand/alerts/groupsscans with selective filters, that does unnecessary work in the hot path.⚡ Suggested reorder
if !a.EndsAt.IsZero() && a.EndsAt.Before(now) { return false } + if !alertMatchesFilterLabels(&a.Alert, matchers) { + return false + } // Set alert's current status based on its label set. // The inhibitor and silencer write to m via the context. ctx = marker.WithContext(ctx, m) api.setAlertStatus(ctx, a.Labels) @@ if !inhibited && len(status.InhibitedBy) != 0 { return false } - return alertMatchesFilterLabels(&a.Alert, matchers) + return true }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/v2/api.go` around lines 483 - 507, The code is calling api.setAlertStatus (which reruns the silencer/inhibitor pipeline) before applying label filters, causing unnecessary work; change the order so you call alertMatchesFilterLabels(&a.Alert, matchers) first and return false when the alert doesn't match, and only then call marker.WithContext(ctx, m) and api.setAlertStatus(ctx, a.Labels) and query m.Status(a.Fingerprint()) for the alerts that passed the label filter; ensure the existing short-circuit checks that use status (m.Status, status.SilencedBy, status.InhibitedBy, status.State) remain after setAlertStatus so they only run when needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@marker/context.go`:
- Around line 35-39: Update the comment on FromContext to accurately describe
its behavior: state that it returns the AlertMarker and a boolean indicating
presence, and that when absent it returns nil and false (not a no-op marker).
Reference the FromContext function and keyAlertMarker so readers know where to
look and ensure callers check the returned ok before using the marker.
---
Duplicate comments:
In `@dispatch/dispatch.go`:
- Around line 131-133: The code currently only guards
metrics.alertsCollector.dispatcher.Store(disp) but leaves d.metrics potentially
nil elsewhere (d.metrics used at lines referenced), causing panics; fix by
normalizing nil metrics in the dispatcher constructor: implement
NewNoopDispatcherMetrics() that returns a *DispatcherMetrics whose
gauges/counters/histograms and alertsCollector are no-op implementations, and
replace any assignment that may leave d.metrics nil with a call to
NewNoopDispatcherMetrics() so d.metrics is never nil (alternatively ensure the
constructor always sets d.metrics to NewNoopDispatcherMetrics() when the
incoming metrics parameter is nil).
---
Nitpick comments:
In `@api/v2/api.go`:
- Around line 483-507: The code is calling api.setAlertStatus (which reruns the
silencer/inhibitor pipeline) before applying label filters, causing unnecessary
work; change the order so you call alertMatchesFilterLabels(&a.Alert, matchers)
first and return false when the alert doesn't match, and only then call
marker.WithContext(ctx, m) and api.setAlertStatus(ctx, a.Labels) and query
m.Status(a.Fingerprint()) for the alerts that passed the label filter; ensure
the existing short-circuit checks that use status (m.Status, status.SilencedBy,
status.InhibitedBy, status.State) remain after setAlertStatus so they only run
when needed.
In `@inhibit/inhibit_test.go`:
- Around line 45-50: The non-muted branch of the test only checks that
status.InhibitedBy is empty but doesn't assert the state; update the else branch
after calling m.Status(fp) to also require that status.State equals
alert.AlertStateActive so the test fails if an unprocessed marker leaves the
state unset (e.g., if inhibit.Inhibitor.Mutes stops calling SetInhibited). Keep
the existing InhibitedBy emptiness check and add the AlertStateActive assertion
referencing status and alert.AlertStateActive.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 172d86ce-7958-458a-bd24-de7fac631af1
📒 Files selected for processing (13)
alert/state.goapi/v2/api.godispatch/dispatch.godispatch/metric.goinhibit/inhibit.goinhibit/inhibit_test.gomarker/alert.gomarker/alert_test.gomarker/context.gonotify/mute_test.gonotify/notify.gosilence/silence.gosilence/silence_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
- silence/silence_test.go
- notify/mute_test.go
- silence/silence.go
- notify/notify.go
231742a to
364d6ab
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
api/v2/api.go (1)
434-470:⚠️ Potential issue | 🟠 MajorUse each group's marker when serializing
/alerts/groups.This handler rebuilds status into one request-scoped
tempMarkerkeyed only by fingerprint, then reads that same value back for every group. If the same alert appears in multiple aggregation groups with different silencing/inhibition state, the last write wins and every group reports the same status. That loses the per-group marker semantics this refactor is introducing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/v2/api.go` around lines 434 - 470, The handler currently uses one request-scoped tempMarker for all groups so an alert present in multiple groups can have its status overwritten; instead, create a per-group marker and compute statuses per group before serializing. After alertGroups is returned, for each alertGroup allocate groupMarker := marker.NewAlertMarker(), run the inhibitor/silencer pipeline for that group's alerts (reuse api.alertFilter or call the underlying setAlertStatus logic with groupMarker) and then call groupMarker.Status(fp) when constructing apiAlert (replace tempMarker.Status(fp) with groupMarker.Status(fp)); keep references to api.alertFilter, alertGroups, tempMarker, and AlertToOpenAPIAlert to locate the spots to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v2/api.go`:
- Around line 479-497: The closure returned by alertFilter reads
api.setAlertStatus without synchronization and can race with Update(); before
constructing the closure in alertFilter, take api.mtx (or api.mtx.RLock/RUnlock
as appropriate) and snapshot the current api.setAlertStatus into a local
variable (e.g., setAlertStatus := api.setAlertStatus), then release the lock and
use that local inside the returned func; reference alertFilter and
setAlertStatus so the closure uses the captured function pointer instead of
accessing api.setAlertStatus directly, ensuring safe concurrent reloads invoked
by Update(), getAlertsHandler, and getAlertGroupsHandler.
In `@dispatch/metric.go`:
- Line 47: The parameter registerLimitMetrics on NewDispatcherMetrics is unused;
either remove it from NewDispatcherMetrics' signature and all callers, or
implement conditional registration: if registerLimitMetrics is true then
register the aggrGroupLimitReached metric (and any related limit metrics) with
the provided prometheus.Registerer `r`, otherwise skip registering it; update
any tests/call sites accordingly so they match the new signature or behavior.
In `@inhibit/inhibit_test.go`:
- Around line 37-53: The helper checkMutes currently only asserts InhibitedBy is
empty when wantMuted is false, which can miss cases where the marker was never
written; update checkMutes (which uses marker.NewAlertMarker,
marker.WithContext, ih.Mutes and m.Status(fp)) to also assert that status.State
== alert.AlertStateActive on the non-muted path (and keep the existing checks
for alert.AlertStateSuppressed and non-empty InhibitedBy when wantMuted is true)
so the test verifies SetInhibited was called for non-muted alerts as well.
---
Outside diff comments:
In `@api/v2/api.go`:
- Around line 434-470: The handler currently uses one request-scoped tempMarker
for all groups so an alert present in multiple groups can have its status
overwritten; instead, create a per-group marker and compute statuses per group
before serializing. After alertGroups is returned, for each alertGroup allocate
groupMarker := marker.NewAlertMarker(), run the inhibitor/silencer pipeline for
that group's alerts (reuse api.alertFilter or call the underlying setAlertStatus
logic with groupMarker) and then call groupMarker.Status(fp) when constructing
apiAlert (replace tempMarker.Status(fp) with groupMarker.Status(fp)); keep
references to api.alertFilter, alertGroups, tempMarker, and AlertToOpenAPIAlert
to locate the spots to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 68f7566c-1035-48df-85ac-115fcc4985c7
📒 Files selected for processing (31)
alert/state.goapi/api.goapi/v2/api.goapi/v2/api_test.goapi/v2/compat.gocmd/alertmanager/main.godispatch/dispatch.godispatch/dispatch_bench_test.godispatch/dispatch_test.godispatch/metric.gofeaturecontrol/featurecontrol.goinhibit/inhibit.goinhibit/inhibit_bench_test.goinhibit/inhibit_test.gomarker/alert.gomarker/alert_test.gomarker/context.gomarker/group.gomarker/group_test.gomarker/marker.gomarker/status.gonotify/mute.gonotify/mute_test.gonotify/notify.goprovider/mem/mem.goprovider/mem/mem_test.gosilence/silence.gosilence/silence_bench_test.gosilence/silence_test.gotypes/types.gotypes/types_test.go
💤 Files with no reviewable changes (4)
- api/api.go
- types/types_test.go
- types/types.go
- provider/mem/mem.go
✅ Files skipped from review due to trivial changes (7)
- marker/status.go
- silence/silence_bench_test.go
- api/v2/api_test.go
- marker/marker.go
- featurecontrol/featurecontrol.go
- marker/group_test.go
- dispatch/dispatch_test.go
🚧 Files skipped from review as they are similar to previous changes (12)
- alert/state.go
- marker/alert.go
- cmd/alertmanager/main.go
- silence/silence_test.go
- api/v2/compat.go
- notify/mute_test.go
- inhibit/inhibit_bench_test.go
- marker/context.go
- dispatch/dispatch_bench_test.go
- inhibit/inhibit.go
- silence/silence.go
- dispatch/dispatch.go
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
inhibit/inhibit_test.go (1)
47-52:⚠️ Potential issue | 🟡 MinorKeep the empty
InhibitedByassertion on the active path.Adding the
AlertStateActivecheck is good, but droppingrequire.Empty(t, status.InhibitedBy, ...)weakens this helper again. The non-muted branch should still prove both that the marker was written and that no inhibition IDs were recorded.Suggested tweak
} else { require.Equal(t, alert.AlertStateActive, status.State, msgAndArgs...) + require.Empty(t, status.InhibitedBy, msgAndArgs...) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inhibit/inhibit_test.go` around lines 47 - 52, The non-muted branch dropped the assertion that no inhibition IDs were recorded; when wantMuted is false you should also assert status.InhibitedBy is empty. Modify the else branch that checks require.Equal(t, alert.AlertStateActive, status.State, msgAndArgs...) to additionally call require.Empty(t, status.InhibitedBy, msgAndArgs...) so the helper verifies both the active state and that no inhibition IDs were written (referencing wantMuted, status.State, status.InhibitedBy, and AlertStateActive).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v2/api.go`:
- Around line 434-442: The code creates a single request-scoped tempMarker via
marker.NewAlertMarker() and passes it into api.alertFilter, which lets
setAlertStatus mutate that shared tempMarker causing status races across groups;
instead, change the logic so /alerts/groups does not reuse a single tempMarker
for the entire request: compute or fetch each group's actual marker/status
(e.g., call the group's marker lookup or instantiate a fresh marker per group)
and pass that per-group marker into setAlertStatus/api.alertFilter rather than
the global tempMarker, ensuring you stop using tempMarker.Status(fp) as a
request-wide cache; update the code paths around api.alertFilter,
setAlertStatus, and any callers that currently rely on the shared tempMarker
(also adjust the same pattern at the other noted spot around lines 469-470).
---
Duplicate comments:
In `@inhibit/inhibit_test.go`:
- Around line 47-52: The non-muted branch dropped the assertion that no
inhibition IDs were recorded; when wantMuted is false you should also assert
status.InhibitedBy is empty. Modify the else branch that checks require.Equal(t,
alert.AlertStateActive, status.State, msgAndArgs...) to additionally call
require.Empty(t, status.InhibitedBy, msgAndArgs...) so the helper verifies both
the active state and that no inhibition IDs were written (referencing wantMuted,
status.State, status.InhibitedBy, and AlertStateActive).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 7fbf439a-752e-4283-977c-8c9f86a7d849
📒 Files selected for processing (2)
api/v2/api.goinhibit/inhibit_test.go
| // Use a temporary marker to "predict" each alert's current status at | ||
| // query time (silenced, inhibited, active). The alertFilter calls | ||
| // setAlertStatus which runs the inhibitor/silencer pipeline and writes | ||
| // the result into tempMarker. We then use tempMarker for the response | ||
| // to preserve backward-compatible behavior that clients depend on. | ||
| // TODO: consider a feature flag or API query param to let callers opt | ||
| // into using the group's actual marker status instead of the prediction. | ||
| tempMarker := marker.NewAlertMarker() | ||
| af := api.alertFilter(ctx, matchers, *params.Silenced, *params.Inhibited, *params.Active, tempMarker) |
There was a problem hiding this comment.
Don’t collapse /alerts/groups back onto a shared request marker.
Using one tempMarker for the whole groups request reintroduces the global-marker behavior this PR is trying to remove. The same alert fingerprint can exist in multiple aggregation groups/routes, so later setAlertStatus writes can overwrite earlier state and tempMarker.Status(fp) can report the wrong status for a given group. /alerts/groups needs the group’s actual marker/status, not a request-wide prediction cache.
Also applies to: 469-470
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/v2/api.go` around lines 434 - 442, The code creates a single
request-scoped tempMarker via marker.NewAlertMarker() and passes it into
api.alertFilter, which lets setAlertStatus mutate that shared tempMarker causing
status races across groups; instead, change the logic so /alerts/groups does not
reuse a single tempMarker for the entire request: compute or fetch each group's
actual marker/status (e.g., call the group's marker lookup or instantiate a
fresh marker per group) and pass that per-group marker into
setAlertStatus/api.alertFilter rather than the global tempMarker, ensuring you
stop using tempMarker.Status(fp) as a request-wide cache; update the code paths
around api.alertFilter, setAlertStatus, and any callers that currently rely on
the shared tempMarker (also adjust the same pattern at the other noted spot
around lines 469-470).
| c.emit(ch, float64(suppressed), labelValues...) | ||
| labelValues[0] = string(alert.AlertStateUnprocessed) | ||
| c.emit(ch, float64(unprocessed), labelValues...) | ||
| return true |
There was a problem hiding this comment.
Here do we want to exclude groups that have 0 in each state? or not?
There was a problem hiding this comment.
I guess not, because these are gauge metrics and can drop to zero, so I assume having zero values in between changes is desired compared to missing metrics.
What do you think?
| // Set alert's current status based on its label set. | ||
| api.setAlertStatus(ctx, a.Labels) | ||
| // The inhibitor and silencer write to m via the context. | ||
| ctx = marker.WithContext(ctx, m) |
There was a problem hiding this comment.
Do you want to create this context outside the function itself? This way we would avoid creating a new context each time the returned func is called
There was a problem hiding this comment.
No, we need to keep it as is since each call to the returned function is a new operation that we want to trace as a child span.
This specific call returns a derived context which is pointing tot he parent.
| status = &groupStatus{} | ||
| m.groups[key] = status | ||
| } | ||
| status.mutedBy = timeIntervalNames |
There was a problem hiding this comment.
If timeIntervalNames is nil or empty, shoudl we actually delete from the map, here?
There was a problem hiding this comment.
I am not sure, but this is just moving existing logic and is not new code.
| "pending", len(allIDs)-len(activeIDs), | ||
| ) | ||
| // TODO: remove this sort once the marker is removed. | ||
| sort.Strings(activeIDs) |
There was a problem hiding this comment.
Are we sure we shouldn't still sort for the API to be consistent and the results comparable? not necessarily here, but somewhere...
There was a problem hiding this comment.
I am not sure, but does it matter if the IDs are in a specific order or do we guarantee this?
| status = &groupStatus{} | ||
| m.groups[key] = status | ||
| } | ||
| status.mutedBy = timeIntervalNames |
There was a problem hiding this comment.
In other places you clone the slice, but not here... Any particular reason?
There was a problem hiding this comment.
This is just moving existing logic and is not new code.
The PR is not focused on Group Markers used by time intervals so I would leave any changes related to those for later.
This change adds alert markers to the aggregation groups in dispatcher. Alert markers replace the global marker and are used to track the state of alerts in each aggregation group. This change touches many components of the alertmanager. Per Group alert markers are passed to the notifiers and then inhibitor and silencer using context. The API has no breaking changes: - /alerts uses a temporary marker to track the state of alerts - /alerts/groups returns the group markers Update metrics to use group markers. The `alertmanager_alerts` metric is moved to dispatcher. The `alertmanager_marked_alerts` metric is removed. By default it behaves the same as before, by aggregating all alerts in the groups. Enabling `group-key-in-metrics` flag will cause the metrics to be grouped by `group_key`. Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
a28f0b4 to
db964e5
Compare
This change adds alert markers to the aggregation groups in dispatcher.
Alert markers replace the global marker and are used to track the state of alerts in each aggregation group.
This change touches many components of the alertmanager.
Per Group alert markers are passed to the notifiers and then inhibitor and silencer using context.
The API has no breaking changes:
The metrics are also updated to use group markers.
The
alertmanager_alertsmetric is moved to dispatcher.The
alertmanager_marked_alertsmetric is removed.By default it behaves the same as before, by aggregating all alerts in the groups.
Enabling
group-key-in-metricsflag will cause the metrics to be grouped bygroup_key.Pull Request Checklist
Please check all the applicable boxes.
Which user-facing changes does this PR introduce?
Summary by CodeRabbit
New Features
Bug Fixes
Refactor