StatusTypeRelevance

🧩 Runbook:

runbook azure

Alert:
Severity: <P1 / P2 / P3>
Source: <Prometheus / Azure Monitor / Loki / App Insights>
Jira Priority: <Blocker / Critical / Major / Minor / Trivial>
Runbook Path: /runbooks/<domain>/<filename>.md
Related Components: <affected service(s)>
Last Reviewed:


1️⃣ Purpose

Short summary of what this runbook covers and why the alert matters.
Explain in one or two sentences what the alert indicates and why it is important.

Example:

This runbook explains how to diagnose and resolve high 5xx error rates on the Application Gateway.
Affected users may experience failed API calls or unavailable frontends.


2️⃣ Trigger Condition

Describe the alert condition or PromQL/Azure Monitor query.
Include threshold and duration if relevant.

Example:
Query: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
Threshold: >1% 5xx responses over 5 minutes
Alert Severity: P1

3️⃣ Immediate Actions

Step-by-step actions to triage and contain the issue.

  1. Verify that the alert is valid (not a false positive).
  2. Check affected services or metrics dashboards.
  3. Review related logs in Loki or Azure Monitor.
  4. Apply mitigation (e.g., restart, scale-up, disable feature toggle).
  5. Escalate if the alert persists beyond the defined SLA.

Include shell or SQL snippets if relevant:

kubectl get pods -n <namespace> | grep CrashLoopBackOff

4️⃣ Root Cause Investigation

Explain what typically causes this alert and how to confirm it.

Possible CauseDiagnostic CommandExpected Output
Configuration errorkubectl describe pod <pod>Misconfigured env var
Resource exhaustionkubectl top podCPU > 90%
Storage fulldf -h on nodeDisk usage > 95%

5️⃣ Remediation / Resolution Steps

List actions to permanently fix the issue.

  • Reconfigure or redeploy service
  • Increase capacity or enable autoscaling
  • Update dependencies or patch version
  • Apply missing IAM permissions

Provide commands or references where applicable.


6️⃣ Escalation Path

LevelRole / TeamContact
PrimaryPlatform Engineering#sdk-ops
SecondaryService Owner<name or contact>
Escalate toProduct Owner / SRE Lead<contact>

7️⃣ Verification

Steps to confirm the issue is resolved.

  1. Re-run dashboard query or alert test.
  2. Ensure metrics return to normal baseline.
  3. Close Jira ticket after 24 h stability.

8️⃣ References


9️⃣ Post-Incident Review

FieldDescription
Incident DateYYYY-MM-DD
Root Cause Summary
Time to Detect (MTTD)
Time to Repair (MTTR)
Preventive Action
Owner

Responsible Team: Platform Engineering / Service Owner