🧩 Runbook:
Alert:
Severity: <P1 / P2 / P3>
Source: <Prometheus / Azure Monitor / Loki / App Insights>
Jira Priority: <Blocker / Critical / Major / Minor / Trivial>
Runbook Path: /runbooks/<domain>/<filename>.md
Related Components: <affected service(s)>
Last Reviewed:
1️⃣ Purpose
Short summary of what this runbook covers and why the alert matters.
Explain in one or two sentences what the alert indicates and why it is important.
Example:
This runbook explains how to diagnose and resolve high 5xx error rates on the Application Gateway.
Affected users may experience failed API calls or unavailable frontends.
2️⃣ Trigger Condition
Describe the alert condition or PromQL/Azure Monitor query.
Include threshold and duration if relevant.
Example:
Query: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
Threshold: >1% 5xx responses over 5 minutes
Alert Severity: P13️⃣ Immediate Actions
Step-by-step actions to triage and contain the issue.
- Verify that the alert is valid (not a false positive).
- Check affected services or metrics dashboards.
- Review related logs in Loki or Azure Monitor.
- Apply mitigation (e.g., restart, scale-up, disable feature toggle).
- Escalate if the alert persists beyond the defined SLA.
Include shell or SQL snippets if relevant:
kubectl get pods -n <namespace> | grep CrashLoopBackOff4️⃣ Root Cause Investigation
Explain what typically causes this alert and how to confirm it.
| Possible Cause | Diagnostic Command | Expected Output |
|---|---|---|
| Configuration error | kubectl describe pod <pod> | Misconfigured env var |
| Resource exhaustion | kubectl top pod | CPU > 90% |
| Storage full | df -h on node | Disk usage > 95% |
5️⃣ Remediation / Resolution Steps
List actions to permanently fix the issue.
- Reconfigure or redeploy service
- Increase capacity or enable autoscaling
- Update dependencies or patch version
- Apply missing IAM permissions
Provide commands or references where applicable.
6️⃣ Escalation Path
| Level | Role / Team | Contact |
|---|---|---|
| Primary | Platform Engineering | #sdk-ops |
| Secondary | Service Owner | <name or contact> |
| Escalate to | Product Owner / SRE Lead | <contact> |
7️⃣ Verification
Steps to confirm the issue is resolved.
- Re-run dashboard query or alert test.
- Ensure metrics return to normal baseline.
- Close Jira ticket after 24 h stability.
8️⃣ References
- Related Runbooks
- Grafana Dashboard:
- Azure Documentation:
- [Internal Wiki: ]
9️⃣ Post-Incident Review
| Field | Description |
|---|---|
| Incident Date | YYYY-MM-DD |
| Root Cause Summary | |
| Time to Detect (MTTD) | |
| Time to Repair (MTTR) | |
| Preventive Action | |
| Owner |
Responsible Team: Platform Engineering / Service Owner