StatusTypeRelevance

ClickHouse Alerting

This document defines the minimum alerting needed for the ClickHouse integration.

The goal is simple:

Alert on the failures that break analytics, schema rollout, or tenant access.

Alerting scope

The current integration needs signals in three areas:

  • ClickHouse availability
  • resource and storage pressure
  • tenant integration readiness for Superset

Required alerts

AlertSignalWhy it mattersFirst action
ClickHouse unreachableService on 8123 or 9000 not reachable, /ping failsSuperset queries and migration tooling stopCheck pod, service, and recent logs
ClickHouse pod restart loopRepeated restarts or pod not readyCurrent deployment has no failover pathCheck OOM, storage, and startup logs
Disk usage highDisk usage above warning or critical thresholdWrites and merges need free spaceFree space, expand storage, or reduce incoming load
Memory pressure highMemory near limit or OOMKilledQueries fail and the pod may restartInspect query load and memory sizing
HTTP failures risingError rate on 8123 or failing readiness probeSuperset users see broken datasets and SQL Lab errorsCorrelate with logs and resource pressure
Migration drift detectedmigrate check returns DIFFManaged Git state and runtime state divergedApply missing SQL or restore missing migration files
Tenant ClickHouse Secret missingSupersetTenant reports ClickHouse Secret not readyTenant Superset cannot configure its datasourceCreate or fix the tenant Secret
Tenant stuck provisioningSupersetTenant stays non-healthy for too longTenant analytics is not usableInspect tenant conditions and Superset pod logs

Suggested thresholds

The current deployment only needs simple thresholds:

  • disk warning: > 85%
  • disk critical: > 95%
  • memory warning: > 85%
  • memory critical: > 95%
  • alert immediately on sustained /ping failure
  • alert on repeated pod restarts in a short period

Exact metric expressions depend on the monitoring stack.

Tenant readiness signals

Tenant-level alerting is part of the ClickHouse integration because Superset depends on correct credential wiring.

Useful signals:

  • ClickHouseSecretReady=False
  • tenant status not HEALTHY
  • Superset pod logs show datasource configuration errors

This catches a common operational case:

  • ClickHouse itself is up
  • but the tenant cannot use it because credentials or wiring are wrong

Operational checks after rollout

After schema rollout, the minimum operational validation is:

  1. migrate all --dry-run
  2. apply the migration
  3. migrate check
  4. verify the SupersetTenant

If these checks pass, the integration path is usually healthy.

Dashboard essentials

A minimal dashboard is enough if it shows:

  • pod readiness
  • restart count
  • CPU and memory
  • disk usage
  • HTTP availability
  • recent migration check results
  • readiness of ClickHouse-enabled SupersetTenant resources

Future extension

If the ClickHouse deployment grows later, alerting can be extended with:

  • backup job failures
  • replication lag
  • detached parts growth
  • replica or shard drift
  • inter-node network failures

Those alerts belong to a larger deployment topology. They are not required for the basic integration concept.

Summary

The alerting model stays small:

  • watch whether ClickHouse is reachable
  • watch storage and memory
  • watch migration drift
  • watch tenant Secret readiness and tenant health

That covers the real failure modes of the ClickHouse integration.