Skip to content

P2: Operations runbook — SLOs, capacity, incidents, failover (#180)#197

Open
dkijania wants to merge 1 commit into
mainfrom
docs/runbook
Open

P2: Operations runbook — SLOs, capacity, incidents, failover (#180)#197
dkijania wants to merge 1 commit into
mainfrom
docs/runbook

Conversation

@dkijania

Copy link
Copy Markdown
Contributor

What & why

Part of the production-readiness epic (#163). Closes #180.

There was one benchmark data point but no runbook, SLOs, or documented failure-mode response.

Adds docs/runbook.md

  • SLOs — starting targets for availability, p50/p99 latency, error rate.
  • What to watch — the /metrics RED signals, readiness, structured logs, and suggested alert thresholds.
  • Scaling & capacity — stateless horizontal scaling, Postgres as the real ceiling, and replicas × PG_MAX_CONNECTIONS vs DB max_connections math.
  • Multi-host Postgres failover — documented semantics and recovery expectations (with an honest "validate for your topology" caveat).
  • Common incidents — a symptom → cause → action table.
  • Deploys & rollback — tied to graceful shutdown + readiness gating.

Linked from the README. References observability/config features delivered by the sibling PRs. An automated replica-failover test is noted as a follow-up (needs a multi-host DB harness), so the doc is the deliverable here.

Testing

Docs only. prettier --debug-check . clean. No application code changed.

🤖 Generated with Claude Code

There was one benchmark data point but no runbook, SLOs, or documented
failure-mode response.

Add docs/runbook.md:
- Starting SLOs (availability, p50/p99 latency, error rate).
- What to watch (the /metrics RED signals, readiness, structured logs) and
  suggested alerts.
- Scaling & capacity guidance — stateless horizontal scaling, Postgres as the
  real ceiling, pool-vs-max_connections math.
- Multi-host Postgres failover semantics and recovery expectations.
- A common-incidents table mapping symptoms to causes and actions.
- Deploy/rollback notes tied to graceful shutdown and readiness gating.

Linked from the README. An automated replica-failover test is noted as a
follow-up (needs a multi-host DB harness).

Closes #180.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QSuak9smCHbp4N17xjjLF6
@dkijania dkijania added documentation Improvements or additions to documentation production-readiness Work toward making the API production-ready / publicly available P2 GA polish / hygiene labels Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation P2 GA polish / hygiene production-readiness Work toward making the API production-ready / publicly available

Projects

None yet

Development

Successfully merging this pull request may close these issues.

P2: Runbook / SLOs / capacity + replica-failover semantics

1 participant