StatusTypeRelevance

Alerting Runbooks - How to create, maintain and use them

This repository stores operational runbooks, including alerting runbooks used by on‑call engineers and SREs to triage and remediate incidents quickly and
consistently.

Goals

  • Standardize responses to common alerts.
  • Reduce time‑to‑mitigation and error rate during incidents.
  • Keep operational knowledge versioned, reviewable, and discoverable.

Repository layout

  • README - You are here.
  • template - Generic runbook template you can copy to start a new runbook.
  • /azure/ - Cloud/provider‑specific runbooks (example: postgres-storage.md).

Create a new alerting runbook

Start from the template

  • Copy template.md to a new file with a descriptive name, e.g. database-disk-near-full.md or alert-api-5xx-spike.md.
  • Keep file names lowercase, words separated by hyphens.

Fill out the essentials first

  • Purpose: One sentence that describes the alert and its impact.
  • Trigger Condition: Exact thresholds or conditions from your alerting tool.
  • Immediate Actions (first 5 minutes): Concrete, copy‑pastable steps.
  • Escalation Path: Who to page or what team/channel to contact.

Add deep‑dive sections as you learn

  • Diagnosis checklist, dashboards to open, common root causes, and command snippets.
  • Reversible mitigation steps with rollback instructions and clear blast‑radius notes.
  • Add screenshots, diagrams, or exported alert JSON to this repository and link to them from the runbook using relative paths.

Validate and test

  • Do a tabletop: follow the runbook step‑by‑step to ensure it’s actionable and unambiguous.
  • Ask a peer to test the runbook who is unfamiliar with the system.

Get a review

  • Open a PR and request review from on‑call/SRE owners for the related service.
  • Use clear commit messages (e.g., “Add runbook for API 5xx spike alert”).

Keeping runbooks healthy

  • Versioning: Use small, frequent PRs. Prefer appending notes over deleting context.
  • Ownership: Each runbook should list an owning team and contact method.
  • Freshness: Add a “Last Reviewed” field and update it after incidents or quarterly reviews.
  • Observability links: Keep links to dashboards, logs, and alerts up‑to‑date.
  • Safety: Mark destructive commands and include rollbacks. Favor read‑only commands by default.

Style guidelines

  • Be concise. Use numbered steps and checklists.
  • Put commands in code blocks and include expected outputs when helpful.
  • Provide time‑boxed actions (e.g., “If disk usage does not drop below 85% within 10 minutes, escalate to DB team”).

Contributing checklist

  • Template sections completed (at minimum: Purpose, Trigger Condition, Immediate Actions, Escalation Path)
  • Links to dashboards/logs verified
  • Rollback steps provided for each mitigation
  • Peer review requested
  • “Last Reviewed” date updated
  • azure/postgres-storage.md - example of a provider‑specific runbook.
  • template.md - copy this file to start a new runbook.