feat(alert): default/disable semantics for consecutive-failure & exhausted-retries alerts#964
Open
alexluong wants to merge 6 commits into
Open
feat(alert): default/disable semantics for consecutive-failure & exhausted-retries alerts#964alexluong wants to merge 6 commits into
alexluong wants to merge 6 commits into
Conversation
…erts Introduce alert.Settings (the resolved, operational alert config) plus two monitor gates: WithConsecutiveFailureEnabled and WithExhaustedRetriesEnabled. Both default to true, so behavior is unchanged until a caller opts out. When consecutive-failure alerting is gated off the monitor neither tracks failures nor auto-disables; when exhausted-retries is gated off it never emits, even with retries enabled. Extracts the consecutive-failure path into a helper to keep the replay/ordering semantics identical on the enabled path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/value rule AlertConfig.ConsecutiveFailureCount and ExhaustedRetriesWindowSeconds become *string so the parse layer can tell three states apart: unset uses the default (100 / 3600), an empty string disables that alert dimension, and any other value must parse to a non-negative integer. AlertConfig.ToConfig resolves the raw values into the operational alert.Settings (domain-owned, so nothing downstream imports config). Validate rejects malformed values at startup. builder wires the resolved gates into the monitor and only builds the exhausted-retries suppression window when enabled with a positive window (0 = alert on every exhaustion). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ence Update the Alerts section of the self-hosting config reference for the new unset/empty/value rule: ALERT_CONSECUTIVE_FAILURE_COUNT defaults to 100 (empty disables), and document the previously-undocumented ALERT_EXHAUSTED_RETRIES_WINDOW_SECONDS (default 3600, empty disables, 0 = no suppression). Also correct stale entries: drop the removed ALERT_CALLBACK_URL, fix the ALERT_AUTO_DISABLE_DESTINATION default (false, not true), and fix the YAML example key (alert, not alerts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document the unset/empty/value behavior for ALERT_CONSECUTIVE_FAILURE_COUNT, ALERT_EXHAUSTED_RETRIES_WINDOW_SECONDS and ALERT_AUTO_DISABLE_DESTINATION in the ManagedConfig schema. Descriptions only — the properties are already typed as string. SDKs are regenerated from this schema at release time. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The *string representation had two problems found by manual QA:
1. caarlos0/env ignores a present-but-empty env var, so `ALERT_..._COUNT=`
resolved to the default instead of disabling — the empty=off rule only
worked via YAML, not env vars (the primary surface for the cloud product).
2. caarlos0/env crashes ("expected a pointer to a Struct") on any non-nil
*string it walks, so setting these in a YAML config file would crash startup
(env.Parse runs after the YAML load).
Replace *string with an OptionalString value type that implements both
TextUnmarshaler (bound by caarlos0/env as a scalar — no crash) and
yaml.Unmarshaler (so `key: ""` expresses the empty/off state). The one case
caarlos0/env cannot surface — a present-but-empty env var — is handled
explicitly via OSInterface.LookupEnv, which also gives env precedence over YAML.
Net: unset -> default, empty -> disabled, value -> value, identically on both
env and YAML, with env > yaml. Adds a full parse-path test covering the matrix
and precedence.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n test Missed call site when migrating AlertConfig fields to OptionalString; the raw int assignment broke the cmd/e2e build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
alexbouchardd
approved these changes
Jun 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Give each alert dimension a single env/YAML key that controls both its value and whether it runs, with consistent unset/empty/value semantics:
""Applies to both:
ALERT_CONSECUTIVE_FAILURE_COUNT(default100)ALERT_EXHAUSTED_RETRIES_WINDOW_SECONDS(default3600)ALERT_AUTO_DISABLE_DESTINATIONstays an independent bool. When the consecutive-failure dimension is disabled, auto-disable simply never triggers (it's a no-op, not an error).Why
Previously there was no way to turn an alert dimension off via config — only to change its threshold. Operators want one knob per dimension: leave it alone for the default, set it to disable, or set a value. No separate
_enabledflag.Note: custom
OptionalStringtype (d9f5f7b)The unset/empty/value rule needs to tell "var present but empty" apart from "var absent" on the env surface. A plain
*stringdoesn't work: the env library (caarlos0/env) treats a present-empty var as unset, and panics on a non-nil*string.OptionalStringimplementsTextUnmarshalerso it binds as a scalar, and env-empty is captured explicitly viaLookupEnv. Calling it out since it's a non-obvious type that exists purely to work around that library behavior.