Skip to content

feat(alert): default/disable semantics for consecutive-failure & exhausted-retries alerts#964

Open
alexluong wants to merge 6 commits into
mainfrom
feat/alert-default-behavior
Open

feat(alert): default/disable semantics for consecutive-failure & exhausted-retries alerts#964
alexluong wants to merge 6 commits into
mainfrom
feat/alert-default-behavior

Conversation

@alexluong

@alexluong alexluong commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

What

Give each alert dimension a single env/YAML key that controls both its value and whether it runs, with consistent unset/empty/value semantics:

Config value Behavior
unset (absent) use the built-in default
empty string "" disable that alert dimension
a value use that value

Applies to both:

  • ALERT_CONSECUTIVE_FAILURE_COUNT (default 100)
  • ALERT_EXHAUSTED_RETRIES_WINDOW_SECONDS (default 3600)

ALERT_AUTO_DISABLE_DESTINATION stays an independent bool. When the consecutive-failure dimension is disabled, auto-disable simply never triggers (it's a no-op, not an error).

Why

Previously there was no way to turn an alert dimension off via config — only to change its threshold. Operators want one knob per dimension: leave it alone for the default, set it to disable, or set a value. No separate _enabled flag.

Note: custom OptionalString type (d9f5f7b)

The unset/empty/value rule needs to tell "var present but empty" apart from "var absent" on the env surface. A plain *string doesn't work: the env library (caarlos0/env) treats a present-empty var as unset, and panics on a non-nil *string. OptionalString implements TextUnmarshaler so it binds as a scalar, and env-empty is captured explicitly via LookupEnv. Calling it out since it's a non-obvious type that exists purely to work around that library behavior.

alexluong and others added 6 commits June 19, 2026 13:54
…erts

Introduce alert.Settings (the resolved, operational alert config) plus two
monitor gates: WithConsecutiveFailureEnabled and WithExhaustedRetriesEnabled.

Both default to true, so behavior is unchanged until a caller opts out. When
consecutive-failure alerting is gated off the monitor neither tracks failures
nor auto-disables; when exhausted-retries is gated off it never emits, even
with retries enabled. Extracts the consecutive-failure path into a helper to
keep the replay/ordering semantics identical on the enabled path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/value rule

AlertConfig.ConsecutiveFailureCount and ExhaustedRetriesWindowSeconds become
*string so the parse layer can tell three states apart: unset uses the default
(100 / 3600), an empty string disables that alert dimension, and any other value
must parse to a non-negative integer.

AlertConfig.ToConfig resolves the raw values into the operational alert.Settings
(domain-owned, so nothing downstream imports config). Validate rejects malformed
values at startup. builder wires the resolved gates into the monitor and only
builds the exhausted-retries suppression window when enabled with a positive
window (0 = alert on every exhaustion).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ence

Update the Alerts section of the self-hosting config reference for the new
unset/empty/value rule: ALERT_CONSECUTIVE_FAILURE_COUNT defaults to 100 (empty
disables), and document the previously-undocumented
ALERT_EXHAUSTED_RETRIES_WINDOW_SECONDS (default 3600, empty disables, 0 = no
suppression).

Also correct stale entries: drop the removed ALERT_CALLBACK_URL, fix the
ALERT_AUTO_DISABLE_DESTINATION default (false, not true), and fix the YAML
example key (alert, not alerts).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document the unset/empty/value behavior for ALERT_CONSECUTIVE_FAILURE_COUNT,
ALERT_EXHAUSTED_RETRIES_WINDOW_SECONDS and ALERT_AUTO_DISABLE_DESTINATION in the
ManagedConfig schema. Descriptions only — the properties are already typed as
string. SDKs are regenerated from this schema at release time.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The *string representation had two problems found by manual QA:
1. caarlos0/env ignores a present-but-empty env var, so `ALERT_..._COUNT=`
   resolved to the default instead of disabling — the empty=off rule only
   worked via YAML, not env vars (the primary surface for the cloud product).
2. caarlos0/env crashes ("expected a pointer to a Struct") on any non-nil
   *string it walks, so setting these in a YAML config file would crash startup
   (env.Parse runs after the YAML load).

Replace *string with an OptionalString value type that implements both
TextUnmarshaler (bound by caarlos0/env as a scalar — no crash) and
yaml.Unmarshaler (so `key: ""` expresses the empty/off state). The one case
caarlos0/env cannot surface — a present-but-empty env var — is handled
explicitly via OSInterface.LookupEnv, which also gives env precedence over YAML.

Net: unset -> default, empty -> disabled, value -> value, identically on both
env and YAML, with env > yaml. Adds a full parse-path test covering the matrix
and precedence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n test

Missed call site when migrating AlertConfig fields to OptionalString;
the raw int assignment broke the cmd/e2e build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants