Alerting Runbooks - How to create, maintain and use them
This repository stores operational runbooks, including alerting runbooks used by on‑call engineers and SREs to triage and remediate incidents quickly and
consistently.
Goals
- Standardize responses to common alerts.
- Reduce time‑to‑mitigation and error rate during incidents.
- Keep operational knowledge versioned, reviewable, and discoverable.
Repository layout
- README - You are here.
- template - Generic runbook template you can copy to start a new runbook.
- /azure/ - Cloud/provider‑specific runbooks (example: postgres-storage.md).
Create a new alerting runbook
Start from the template
- Copy template.md to a new file with a descriptive name, e.g. database-disk-near-full.md or alert-api-5xx-spike.md.
- Keep file names lowercase, words separated by hyphens.
Fill out the essentials first
- Purpose: One sentence that describes the alert and its impact.
- Trigger Condition: Exact thresholds or conditions from your alerting tool.
- Immediate Actions (first 5 minutes): Concrete, copy‑pastable steps.
- Escalation Path: Who to page or what team/channel to contact.
Add deep‑dive sections as you learn
- Diagnosis checklist, dashboards to open, common root causes, and command snippets.
- Reversible mitigation steps with rollback instructions and clear blast‑radius notes.
Link supporting material
- Add screenshots, diagrams, or exported alert JSON to this repository and link to them from the runbook using relative paths.
Validate and test
- Do a tabletop: follow the runbook step‑by‑step to ensure it’s actionable and unambiguous.
- Ask a peer to test the runbook who is unfamiliar with the system.
Get a review
- Open a PR and request review from on‑call/SRE owners for the related service.
- Use clear commit messages (e.g., “Add runbook for API 5xx spike alert”).
Keeping runbooks healthy
- Versioning: Use small, frequent PRs. Prefer appending notes over deleting context.
- Ownership: Each runbook should list an owning team and contact method.
- Freshness: Add a “Last Reviewed” field and update it after incidents or quarterly reviews.
- Observability links: Keep links to dashboards, logs, and alerts up‑to‑date.
- Safety: Mark destructive commands and include rollbacks. Favor read‑only commands by default.
Style guidelines
- Be concise. Use numbered steps and checklists.
- Put commands in code blocks and include expected outputs when helpful.
- Provide time‑boxed actions (e.g., “If disk usage does not drop below 85% within 10 minutes, escalate to DB team”).
Contributing checklist
- Template sections completed (at minimum: Purpose, Trigger Condition, Immediate Actions, Escalation Path)
- Links to dashboards/logs verified
- Rollback steps provided for each mitigation
- Peer review requested
- “Last Reviewed” date updated
Related examples in this repo
- azure/postgres-storage.md - example of a provider‑specific runbook.
- template.md - copy this file to start a new runbook.