🧩 Runbook:

runbook azure

Alert:
Severity: <P1 / P2 / P3>
Source: <Prometheus / Azure Monitor / Loki / App Insights>
Jira Priority: <Blocker / Critical / Major / Minor / Trivial>
Runbook Path: /runbooks/<domain>/<filename>.md
Related Components: <affected service(s)>
Last Reviewed:

1️⃣ Purpose

Short summary of what this runbook covers and why the alert matters.
Explain in one or two sentences what the alert indicates and why it is important.

Example:

This runbook explains how to diagnose and resolve high 5xx error rates on the Application Gateway.
Affected users may experience failed API calls or unavailable frontends.

2️⃣ Trigger Condition

Describe the alert condition or PromQL/Azure Monitor query.
Include threshold and duration if relevant.

Example:
Query: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
Threshold: >1% 5xx responses over 5 minutes
Alert Severity: P1

3️⃣ Immediate Actions

Step-by-step actions to triage and contain the issue.

Verify that the alert is valid (not a false positive).
Check affected services or metrics dashboards.
Review related logs in Loki or Azure Monitor.
Apply mitigation (e.g., restart, scale-up, disable feature toggle).
Escalate if the alert persists beyond the defined SLA.

Include shell or SQL snippets if relevant:

kubectl get pods -n <namespace> | grep CrashLoopBackOff

4️⃣ Root Cause Investigation

Explain what typically causes this alert and how to confirm it.

Possible Cause	Diagnostic Command	Expected Output
Configuration error	`kubectl describe pod <pod>`	Misconfigured env var
Resource exhaustion	`kubectl top pod`	CPU > 90%
Storage full	`df -h` on node	Disk usage > 95%

5️⃣ Remediation / Resolution Steps

List actions to permanently fix the issue.

Reconfigure or redeploy service
Increase capacity or enable autoscaling
Update dependencies or patch version
Apply missing IAM permissions

Provide commands or references where applicable.

6️⃣ Escalation Path

Level	Role / Team	Contact
Primary	Platform Engineering	`#sdk-ops`
Secondary	Service Owner	`<name or contact>`
Escalate to	Product Owner / SRE Lead	`<contact>`

7️⃣ Verification

Steps to confirm the issue is resolved.

Re-run dashboard query or alert test.
Ensure metrics return to normal baseline.
Close Jira ticket after 24 h stability.

8️⃣ References

9️⃣ Post-Incident Review

Field	Description
Incident Date	YYYY-MM-DD
Root Cause Summary
Time to Detect (MTTD)
Time to Repair (MTTR)
Preventive Action
Owner

Responsible Team: Platform Engineering / Service Owner

SDK docs

Explorer

template

🧩 Runbook:

1️⃣ Purpose

2️⃣ Trigger Condition

3️⃣ Immediate Actions

4️⃣ Root Cause Investigation

5️⃣ Remediation / Resolution Steps

6️⃣ Escalation Path

7️⃣ Verification

8️⃣ References

9️⃣ Post-Incident Review

Graph View

Backlinks

Zuletzt bearbeitete Seiten

0---Allgemeine-Anforderungen

0---Pentest---Bewertung

0.1-Gesamtübersicht

0.2---Support,-Wartung

001_Overview