🧩 PostgreSQL Storage Utilization (Azure Postgres Flexible Server)

runbook azure

Alert: storage_percent high utilization thresholds (85 %, 90 %, 95 %)
Severity: P2 / P1 / P0 (by threshold)
Source: Azure Monitor
Jira Priority: Minor (85 %) / Critical (90 %) / Blocker (95 %)
Runbook Path: /runbooks/azure/postgres-storage.md
Related Components: Azure Database for PostgreSQL - Flexible Server (sdk-coredb-prod, sdk-usecasedb, disrupt, sdk-coredb-develop, sdk-usecasedb-dev, sdk-usecasedb-de)
Last Reviewed: 2025-11-18

1️⃣ Purpose

Detect and respond to increasing storage utilization on Azure Database for PostgreSQL - Flexible Server.
High storage usage can lead to service disruption, failed writes, or emergency scale-ups. This runbook explains how to triage, investigate, and remediate storage pressure.

2️⃣ Trigger Condition

Azure Monitor metric thresholds on Microsoft.DBforPostgreSQL/flexibleServers → storage_percent (Maximum, 1 min interval).

Metric: storage_percent
Namespace: Microsoft.DBforPostgreSQL/flexibleServers
Aggregation: Max (1m)
Thresholds:
  - name: pg-storage-85
    condition: storage_percent > 85
    duration: 15m
    severity: P2
  - name: pg-storage-90
    condition: storage_percent > 90
    duration: 10m
    severity: P1
  - name: pg-storage-95
    condition: storage_percent > 95
    duration: 5m
    severity: P0
Action Group: sdk-alerts

3️⃣ Immediate Actions

Verify the alert on Azure Monitor: confirm the resource and current storage_percent trend.
Check available storage and configured size for the server.
Inspect largest databases and tables by size; confirm unexpected growth.
Apply a mitigation if growth is ongoing:
- Short-term: increase storage size for the Flexible Server, or scale storage tier if allowed.
- Reduce write pressure (pause batch jobs, vacuum/analyze aggressively).
- Free space by removing unneeded data (rotate logs, purge temp tables) per data retention policy.
If within 10% of full capacity and growth continues, escalate to on-call DBA.

Useful commands/snippets:

# Azure CLI: current metric
az monitor metrics list \
  --resource /subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.DBforPostgreSQL/flexibleServers/<server> \
  --metric-names storage_percent \
  --interval PT1M

-- Top tables by size (PostgreSQL)
SELECT schemaname, relname, pg_size_pretty(pg_total_relation_size(relid)) AS total
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;

4️⃣ Root Cause Investigation

Possible Cause	Diagnostic Command/Check	Expected Output/Indicator
Unexpected data growth	`SELECT` top relations by size; review ingestion/backfill jobs	One/few tables growing rapidly
Autovacuum not reclaiming space	`VACUUM`/`ANALYZE` status; `pg_stat_all_tables`, dead tuples	High dead tuples, bloat indicators
Long-running transactions	`pg_stat_activity`	Old transactions preventing vacuum
Log or temp file accumulation	Check server logs, temp files, extensions producing artifacts	Large logs/temp usage
Retention misconfiguration	Data retention policy not applied	Historical data beyond policy present

5️⃣ Remediation / Resolution Steps

Short-term containment:

Increase allocated storage capacity on the Flexible Server where possible.
Run VACUUM (FULL) on targeted tables during low traffic windows if bloat is confirmed.
Purge obsolete data per retention policy; archive to cheaper storage if needed.
Pause or throttle ingestion/backfill/batch jobs until storage returns below 80%.

Long-term fixes:

Implement automatic partitioning and retention for large time-series tables.
Tune autovacuum settings for high-churn tables.
Right-size storage and IOPS for sustained growth.
Add monitoring for sudden deltas in table size to catch regressions early.

6️⃣ Escalation Path

Level	Role / Team	Contact
Primary	Platform Engineering (DBOps)	`#sdk-ops`
Secondary	Service Owner	`<name or contact>`
Escalate to	Product Owner / SRE Lead	`<contact>`

7️⃣ Verification

Confirm storage_percent returns to a stable baseline (< 80%).
Ensure alerts auto-resolve in Azure Monitor.
Validate that ingestion/batch jobs are resumed safely if they were paused.
Update the Jira ticket with root cause and prevention notes.

8️⃣ References

9️⃣ Post-Incident Review

Field	Description
Incident Date	YYYY-MM-DD
Root Cause Summary
Time to Detect (MTTD)
Time to Repair (MTTR)
Preventive Action
Owner	Platform Engineering

Responsible Team: Platform Engineering / Service Owner

SDK docs

Explorer

postgres-storage

🧩 PostgreSQL Storage Utilization (Azure Postgres Flexible Server)

1️⃣ Purpose

2️⃣ Trigger Condition

3️⃣ Immediate Actions

4️⃣ Root Cause Investigation

5️⃣ Remediation / Resolution Steps

6️⃣ Escalation Path

7️⃣ Verification

8️⃣ References

9️⃣ Post-Incident Review

Graph View

Backlinks

Zuletzt bearbeitete Seiten

0---Allgemeine-Anforderungen

0---Pentest---Bewertung

0.1-Gesamtübersicht

0.2---Support,-Wartung

001_Overview