🧩 PostgreSQL Storage Utilization (Azure Postgres Flexible Server)
Alert: storage_percent high utilization thresholds (85 %, 90 %, 95 %)
Severity: P2 / P1 / P0 (by threshold)
Source: Azure Monitor
Jira Priority: Minor (85 %) / Critical (90 %) / Blocker (95 %)
Runbook Path: /runbooks/azure/postgres-storage.md
Related Components: Azure Database for PostgreSQL - Flexible Server (sdk-coredb-prod, sdk-usecasedb, disrupt, sdk-coredb-develop, sdk-usecasedb-dev, sdk-usecasedb-de)
Last Reviewed: 2025-11-18
1️⃣ Purpose
Detect and respond to increasing storage utilization on Azure Database for PostgreSQL - Flexible Server.
High storage usage can lead to service disruption, failed writes, or emergency scale-ups. This runbook explains how to triage, investigate, and remediate storage pressure.
2️⃣ Trigger Condition
Azure Monitor metric thresholds on Microsoft.DBforPostgreSQL/flexibleServers → storage_percent (Maximum, 1 min interval).
Metric: storage_percent
Namespace: Microsoft.DBforPostgreSQL/flexibleServers
Aggregation: Max (1m)
Thresholds:
- name: pg-storage-85
condition: storage_percent > 85
duration: 15m
severity: P2
- name: pg-storage-90
condition: storage_percent > 90
duration: 10m
severity: P1
- name: pg-storage-95
condition: storage_percent > 95
duration: 5m
severity: P0
Action Group: sdk-alerts3️⃣ Immediate Actions
- Verify the alert on Azure Monitor: confirm the resource and current
storage_percenttrend. - Check available storage and configured size for the server.
- Inspect largest databases and tables by size; confirm unexpected growth.
- Apply a mitigation if growth is ongoing:
- Short-term: increase storage size for the Flexible Server, or scale storage tier if allowed.
- Reduce write pressure (pause batch jobs, vacuum/analyze aggressively).
- Free space by removing unneeded data (rotate logs, purge temp tables) per data retention policy.
- If within 10% of full capacity and growth continues, escalate to on-call DBA.
Useful commands/snippets:
# Azure CLI: current metric
az monitor metrics list \
--resource /subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.DBforPostgreSQL/flexibleServers/<server> \
--metric-names storage_percent \
--interval PT1M-- Top tables by size (PostgreSQL)
SELECT schemaname, relname, pg_size_pretty(pg_total_relation_size(relid)) AS total
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;4️⃣ Root Cause Investigation
| Possible Cause | Diagnostic Command/Check | Expected Output/Indicator |
|---|---|---|
| Unexpected data growth | SELECT top relations by size; review ingestion/backfill jobs | One/few tables growing rapidly |
| Autovacuum not reclaiming space | VACUUM/ANALYZE status; pg_stat_all_tables, dead tuples | High dead tuples, bloat indicators |
| Long-running transactions | pg_stat_activity | Old transactions preventing vacuum |
| Log or temp file accumulation | Check server logs, temp files, extensions producing artifacts | Large logs/temp usage |
| Retention misconfiguration | Data retention policy not applied | Historical data beyond policy present |
5️⃣ Remediation / Resolution Steps
Short-term containment:
- Increase allocated storage capacity on the Flexible Server where possible.
- Run
VACUUM (FULL)on targeted tables during low traffic windows if bloat is confirmed. - Purge obsolete data per retention policy; archive to cheaper storage if needed.
- Pause or throttle ingestion/backfill/batch jobs until storage returns below 80%.
Long-term fixes:
- Implement automatic partitioning and retention for large time-series tables.
- Tune autovacuum settings for high-churn tables.
- Right-size storage and IOPS for sustained growth.
- Add monitoring for sudden deltas in table size to catch regressions early.
6️⃣ Escalation Path
| Level | Role / Team | Contact |
|---|---|---|
| Primary | Platform Engineering (DBOps) | #sdk-ops |
| Secondary | Service Owner | <name or contact> |
| Escalate to | Product Owner / SRE Lead | <contact> |
7️⃣ Verification
- Confirm
storage_percentreturns to a stable baseline (< 80%). - Ensure alerts auto-resolve in Azure Monitor.
- Validate that ingestion/batch jobs are resumed safely if they were paused.
- Update the Jira ticket with root cause and prevention notes.
8️⃣ References
- Runbook: PostgreSQL Storage Utilization
- Azure Docs: Supported Metrics for Flexible Server
- CLI Reference:
az monitor metrics alert
9️⃣ Post-Incident Review
| Field | Description |
|---|---|
| Incident Date | YYYY-MM-DD |
| Root Cause Summary | |
| Time to Detect (MTTD) | |
| Time to Repair (MTTR) | |
| Preventive Action | |
| Owner | Platform Engineering |
Responsible Team: Platform Engineering / Service Owner