ClickHouse Alerting
This document defines the minimum alerting needed for the ClickHouse integration.
The goal is simple:
Alert on the failures that break analytics, schema rollout, or tenant access.
Alerting scope
The current integration needs signals in three areas:
- ClickHouse availability
- resource and storage pressure
- tenant integration readiness for Superset
Required alerts
| Alert | Signal | Why it matters | First action |
|---|---|---|---|
| ClickHouse unreachable | Service on 8123 or 9000 not reachable, /ping fails | Superset queries and migration tooling stop | Check pod, service, and recent logs |
| ClickHouse pod restart loop | Repeated restarts or pod not ready | Current deployment has no failover path | Check OOM, storage, and startup logs |
| Disk usage high | Disk usage above warning or critical threshold | Writes and merges need free space | Free space, expand storage, or reduce incoming load |
| Memory pressure high | Memory near limit or OOMKilled | Queries fail and the pod may restart | Inspect query load and memory sizing |
| HTTP failures rising | Error rate on 8123 or failing readiness probe | Superset users see broken datasets and SQL Lab errors | Correlate with logs and resource pressure |
| Migration drift detected | migrate check returns DIFF | Managed Git state and runtime state diverged | Apply missing SQL or restore missing migration files |
| Tenant ClickHouse Secret missing | SupersetTenant reports ClickHouse Secret not ready | Tenant Superset cannot configure its datasource | Create or fix the tenant Secret |
| Tenant stuck provisioning | SupersetTenant stays non-healthy for too long | Tenant analytics is not usable | Inspect tenant conditions and Superset pod logs |
Suggested thresholds
The current deployment only needs simple thresholds:
- disk warning:
> 85% - disk critical:
> 95% - memory warning:
> 85% - memory critical:
> 95% - alert immediately on sustained
/pingfailure - alert on repeated pod restarts in a short period
Exact metric expressions depend on the monitoring stack.
Tenant readiness signals
Tenant-level alerting is part of the ClickHouse integration because Superset depends on correct credential wiring.
Useful signals:
ClickHouseSecretReady=False- tenant status not
HEALTHY - Superset pod logs show datasource configuration errors
This catches a common operational case:
- ClickHouse itself is up
- but the tenant cannot use it because credentials or wiring are wrong
Operational checks after rollout
After schema rollout, the minimum operational validation is:
migrate all --dry-run- apply the migration
migrate check- verify the
SupersetTenant
If these checks pass, the integration path is usually healthy.
Dashboard essentials
A minimal dashboard is enough if it shows:
- pod readiness
- restart count
- CPU and memory
- disk usage
- HTTP availability
- recent migration check results
- readiness of ClickHouse-enabled
SupersetTenantresources
Future extension
If the ClickHouse deployment grows later, alerting can be extended with:
- backup job failures
- replication lag
- detached parts growth
- replica or shard drift
- inter-node network failures
Those alerts belong to a larger deployment topology. They are not required for the basic integration concept.
Summary
The alerting model stays small:
- watch whether ClickHouse is reachable
- watch storage and memory
- watch migration drift
- watch tenant Secret readiness and tenant health
That covers the real failure modes of the ClickHouse integration.