StatusTypeRelevance

🧩 PostgreSQL Storage Utilization (Azure Postgres Flexible Server)

runbook azure

Alert: storage_percent high utilization thresholds (85 %, 90 %, 95 %)
Severity: P2 / P1 / P0 (by threshold)
Source: Azure Monitor
Jira Priority: Minor (85 %) / Critical (90 %) / Blocker (95 %)
Runbook Path: /runbooks/azure/postgres-storage.md
Related Components: Azure Database for PostgreSQL - Flexible Server (sdk-coredb-prod, sdk-usecasedb, disrupt, sdk-coredb-develop, sdk-usecasedb-dev, sdk-usecasedb-de)
Last Reviewed: 2025-11-18


1️⃣ Purpose

Detect and respond to increasing storage utilization on Azure Database for PostgreSQL - Flexible Server.
High storage usage can lead to service disruption, failed writes, or emergency scale-ups. This runbook explains how to triage, investigate, and remediate storage pressure.


2️⃣ Trigger Condition

Azure Monitor metric thresholds on Microsoft.DBforPostgreSQL/flexibleServersstorage_percent (Maximum, 1 min interval).

Metric: storage_percent
Namespace: Microsoft.DBforPostgreSQL/flexibleServers
Aggregation: Max (1m)
Thresholds:
  - name: pg-storage-85
    condition: storage_percent > 85
    duration: 15m
    severity: P2
  - name: pg-storage-90
    condition: storage_percent > 90
    duration: 10m
    severity: P1
  - name: pg-storage-95
    condition: storage_percent > 95
    duration: 5m
    severity: P0
Action Group: sdk-alerts

3️⃣ Immediate Actions

  1. Verify the alert on Azure Monitor: confirm the resource and current storage_percent trend.
  2. Check available storage and configured size for the server.
  3. Inspect largest databases and tables by size; confirm unexpected growth.
  4. Apply a mitigation if growth is ongoing:
    • Short-term: increase storage size for the Flexible Server, or scale storage tier if allowed.
    • Reduce write pressure (pause batch jobs, vacuum/analyze aggressively).
    • Free space by removing unneeded data (rotate logs, purge temp tables) per data retention policy.
  5. If within 10% of full capacity and growth continues, escalate to on-call DBA.

Useful commands/snippets:

# Azure CLI: current metric
az monitor metrics list \
  --resource /subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.DBforPostgreSQL/flexibleServers/<server> \
  --metric-names storage_percent \
  --interval PT1M
-- Top tables by size (PostgreSQL)
SELECT schemaname, relname, pg_size_pretty(pg_total_relation_size(relid)) AS total
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;

4️⃣ Root Cause Investigation

Possible CauseDiagnostic Command/CheckExpected Output/Indicator
Unexpected data growthSELECT top relations by size; review ingestion/backfill jobsOne/few tables growing rapidly
Autovacuum not reclaiming spaceVACUUM/ANALYZE status; pg_stat_all_tables, dead tuplesHigh dead tuples, bloat indicators
Long-running transactionspg_stat_activityOld transactions preventing vacuum
Log or temp file accumulationCheck server logs, temp files, extensions producing artifactsLarge logs/temp usage
Retention misconfigurationData retention policy not appliedHistorical data beyond policy present

5️⃣ Remediation / Resolution Steps

Short-term containment:

  • Increase allocated storage capacity on the Flexible Server where possible.
  • Run VACUUM (FULL) on targeted tables during low traffic windows if bloat is confirmed.
  • Purge obsolete data per retention policy; archive to cheaper storage if needed.
  • Pause or throttle ingestion/backfill/batch jobs until storage returns below 80%.

Long-term fixes:

  • Implement automatic partitioning and retention for large time-series tables.
  • Tune autovacuum settings for high-churn tables.
  • Right-size storage and IOPS for sustained growth.
  • Add monitoring for sudden deltas in table size to catch regressions early.

6️⃣ Escalation Path

LevelRole / TeamContact
PrimaryPlatform Engineering (DBOps)#sdk-ops
SecondaryService Owner<name or contact>
Escalate toProduct Owner / SRE Lead<contact>

7️⃣ Verification

  1. Confirm storage_percent returns to a stable baseline (< 80%).
  2. Ensure alerts auto-resolve in Azure Monitor.
  3. Validate that ingestion/batch jobs are resumed safely if they were paused.
  4. Update the Jira ticket with root cause and prevention notes.

8️⃣ References


9️⃣ Post-Incident Review

FieldDescription
Incident DateYYYY-MM-DD
Root Cause Summary
Time to Detect (MTTD)
Time to Repair (MTTR)
Preventive Action
OwnerPlatform Engineering

Responsible Team: Platform Engineering / Service Owner