ClickHouse Integration
This document describes a simple and stable way to integrate ClickHouse into the platform as the main analytics database for Superset.
The concept follows one rule:
Keep schema, rollout, and tenant integration explicit. Use plain SQL, fixed contracts, and as few moving parts as possible.
Goal
- ClickHouse provides the analytical database for tenant-specific reporting and exploration.
- Superset connects directly to ClickHouse.
- Each tenant works in its own isolated ClickHouse database.
- Schema is managed in Git as executable SQL.
- Data loading is separate from schema rollout.
Core idea
The platform does not need a modeling layer for ClickHouse.
The stable path is:
- deploy ClickHouse
- create one database per tenant
- manage schema as SQL migrations in Git
- create tenant credentials
- expose those credentials to Superset through a fixed Secret contract
- let ingestion workflows load data into the existing tables
This keeps the system understandable for operators and developers:
- the SQL in Git is the schema
- the migration runner applies that SQL
- Superset only needs working tenant credentials
Architecture
Git
├─ ClickHouse deployment manifests
├─ global SQL migrations
├─ tenant bootstrap SQL
├─ tenant schema SQL
└─ tenant credential manifests
Runtime
├─ one ClickHouse deployment
├─ one ClickHouse database per tenant
├─ one ClickHouse user per tenant
├─ one Superset tenant per organization
└─ one Superset datasource per tenant pointing to its ClickHouse database
Data flow
├─ data lands in curated storage
├─ loader jobs insert data into ClickHouse tables
└─ Superset reads from ClickHousePrinciples
- SQL-first: schema is written directly as SQL.
- Git-first: all managed schema changes are reviewed in Git.
- Tenant isolation first: every tenant gets its own database and credentials.
- Explicit rollout: schema changes are applied deliberately, not by hidden background automation.
- Small contracts: Superset integration depends only on a small credential Secret.
- Separate responsibilities: deployment, schema rollout, Superset lifecycle, and ingestion are independent workflows.
ClickHouse deployment
ClickHouse runs as a Kubernetes workload and is treated as a shared analytics backend.
Current expectations:
- one Kubernetes deployment context for ClickHouse
- persistent storage
- stable in-cluster service name
- HTTP access for Superset
- native access for operational tooling
The current deployment may stay small for now. It can be expanded later without changing the tenant contract.
Tenant model
Each tenant gets:
- one database:
tenant_<slug> - one role:
role_tenant_<slug> - one user:
user_tenant_<slug>
The tenant user is used by Superset for that tenant.
This keeps the security model simple:
- Superset tenant A only sees database A
- Superset tenant B only sees database B
- there is no shared base database in the normal path
Schema management
Schema is managed directly as versioned SQL migrations.
Migration structure:
migrations/
<environment>/
global/
tenants/
<tenant>/
bootstrap/
schema/
tenants.txtMeaning:
global/contains shared operational SQLbootstrap/contains tenant database and access setupschema/contains tenant table definitions and later changestenants.txtdefines which tenants are active for rollout
Rules:
- the SQL file is the contract
- the database name is written explicitly in tenant SQL
- schema changes are additive and versioned
- rollout happens through a migration command, not through ad-hoc manual edits in ClickHouse
Table design
This concept does not enforce a special naming model beyond clarity and stability.
Tables should be:
- domain-specific
- easy to understand
- stable enough for dashboards and queries
Good examples are simple table names such as:
batterydatareservationdata_devicereservationdata_reservations
The concept does not require a separate space dimension in the schema contract.
Data loading
Data loading is outside the schema mechanism.
That means:
- schema migrations create and change tables
- loader jobs insert data into those tables
- loaders can be implemented independently of the migration runner
Typical sources:
- Parquet files in curated storage
- batch exports
- domain-specific processing pipelines
This separation is useful because schema rollout and data ingestion have different failure modes and different release cycles.
Superset integration
Superset is integrated per tenant.
Each SupersetTenant can enable a ClickHouse datasource.
The runtime contract is:
- ClickHouse host and port are platform-level settings
- the tenant-specific database name, username, and password come from a Kubernetes Secret
- the operator injects these values into the Superset pod
- Superset registers or updates the ClickHouse datasource automatically
Required Secret keys:
databaseusernamepassword
This is enough to make the Superset side simple and repeatable.
Operational workflow
The standard path for a new tenant is:
- create tenant bootstrap SQL
- add tenant schema SQL
- activate the tenant in the environment
- run migration dry-run
- apply migrations
- generate tenant credentials
- publish the tenant credential Secret for Superset
- create or update the
SupersetTenant - verify datasource readiness
The standard path for a schema change is:
- add a new SQL migration file
- review it in Git
- run dry-run
- apply it
- run state checks
- verify dashboards and queries
What stays stable over time
Even if the ClickHouse deployment grows later, these contracts should remain stable:
- tenant database naming
- tenant user naming
- migration directory layout
- explicit SQL migrations
- Secret keys for Superset integration
That allows the platform to improve deployment topology later without redesigning the integration model.
Summary
ClickHouse integration is simple when the contract stays small:
- ClickHouse is the analytics database
- Superset connects directly to it
- every tenant gets its own database and credentials
- schema is managed as SQL in Git
- migrations are applied explicitly
- ingestion stays separate from schema rollout
This is not the most abstract solution. It is the one that is easiest to understand, operate, and extend safely.