ClickHouse Schema Design

ALWAYS LOAD when creating or modifying ClickHouse tables.

Goals

Sub-second query response on billion-row tables
10x+ compression ratios vs raw data
Zero-maintenance data retention and tiered storage
Queries that scan <1% of data instead of full table scans

Reference Documentation

Search terms: schema design, ORDER BY, PRIMARY KEY, PARTITION BY, data types, TTL, index_granularity

Critical Rules

[CRITICAL]

ORDER BY determines query performance. Put most-filtered columns first, low cardinality before high.
PARTITION BY is for data management, not query speed. Use ORDER BY to speed up queries.
Use smallest data types possible. UInt32 not UInt64 if values fit.

[HIGH]

LowCardinality for strings with <10K unique values. Saves storage and speeds queries.
Avoid Nullable when possible. Use DEFAULT values instead.
PRIMARY KEY can be a prefix of ORDER BY. Reduces index size.

[MEDIUM]

3-5 columns in ORDER BY is typical. More columns rarely help, fewer may miss optimization opportunities.

Engine Selection

Engine	Use Case	Key Behavior
`MergeTree`	Append-only data (events, logs)	Standard storage, no special merge logic
`ReplacingMergeTree`	Deduplication, upserts	Keeps latest row per ORDER BY key; query with `argMax` pattern
`SummingMergeTree`	Pre-aggregated counters	Automatically sums numeric columns on merge
`AggregatingMergeTree`	Complex aggregates (uniq, quantile)	Stores intermediate states; use `-State`/`-Merge` functions
`CollapsingMergeTree`	Mutable rows	Uses +1/-1 sign column to cancel/update rows

For engine examples and Materialized Views, see clickhouse-materialized-views skill.

ORDER BY Selection

The 3-5 Column Rule

ORDER BY (
    tenant_id,      -- 1. Lowest cardinality, most filtered
    event_date,     -- 2. Time component (common filter)
    user_id,        -- 3. Medium cardinality
    event_type      -- 4. Higher cardinality (optional)
)

Decision Process

Which columns appear in WHERE clauses? - These go first
What's the cardinality? - Lower cardinality columns before higher
Is there a time dimension? - Usually included for range queries

Good vs Bad

-- BAD: High cardinality first, rarely-filtered columns
ORDER BY (user_id, timestamp, tenant_id)

-- GOOD: Low cardinality first, commonly-filtered columns
ORDER BY (tenant_id, toDate(timestamp), user_id)

PRIMARY KEY as Prefix

-- Full sorting key for data layout
ORDER BY (tenant_id, event_date, user_id, event_type)

-- Shorter primary key for smaller index (optional)
PRIMARY KEY (tenant_id, event_date)

PARTITION BY Guidelines

Size Targets

Scenario	Target Partition Size
General MergeTree tables	1-300 GB
SummingMergeTree / ReplacingMergeTree	400 MB - 40 GB
Small tables (<5 GB total)	No partitioning

Common Patterns

-- Monthly (most common for analytics)
PARTITION BY toYYYYMM(event_date)

-- Daily (high volume, >1TB/month)
PARTITION BY toDate(event_date)

-- Multi-tenant with time
PARTITION BY (tenant_id, toYYYYMM(event_date))

-- No partitioning (small tables)
-- Simply omit PARTITION BY clause

Anti-Pattern

-- BAD: Over-partitioning creates thousands of small parts
PARTITION BY (toDate(event_date), user_id)

-- BAD: Partitioning by high-cardinality column
PARTITION BY user_id

Data Type Optimization

Numbers

-- Use smallest type that fits your data
count UInt16,           -- Max 65,535 (instead of UInt64)
percentage Float32,     -- Instead of Float64 if 6-7 digits precision is enough
flags UInt8,            -- For small integers, booleans

Strings

-- LowCardinality for <10K unique values
country LowCardinality(String),
status LowCardinality(String),
event_type LowCardinality(String),

-- Regular String for high cardinality
user_agent String,
url String,

Dates and Times

-- Use simplest type that meets requirements
event_date Date,                    -- If you only need date
event_time DateTime,                -- If you need seconds
event_time_precise DateTime64(3),   -- Only if you need milliseconds

Avoiding Nullable

-- BAD: Nullable adds overhead and complexity
user_id Nullable(UInt64),

-- GOOD: Use DEFAULT for missing values
user_id UInt64 DEFAULT 0,

-- GOOD: Use empty string for missing text
name String DEFAULT '',

TTL Configuration

Delete Old Data

CREATE TABLE events (
    event_date Date,
    ...
) ENGINE = MergeTree()
ORDER BY (...)
TTL event_date + INTERVAL 90 DAY DELETE;

Tiered Storage

CREATE TABLE events (
    event_date Date,
    ...
) ENGINE = MergeTree()
ORDER BY (...)
TTL
    event_date + INTERVAL 7 DAY TO VOLUME 'hot',
    event_date + INTERVAL 30 DAY TO VOLUME 'warm',
    event_date + INTERVAL 365 DAY DELETE;

Column-Level TTL

CREATE TABLE events (
    event_date Date,
    user_id UInt64,
    -- Delete PII after 30 days, keep aggregated data
    email String TTL event_date + INTERVAL 30 DAY,
    ip_address String TTL event_date + INTERVAL 30 DAY
) ENGINE = MergeTree()
ORDER BY (event_date, user_id);

Complete Example

CREATE TABLE analytics_events (
    -- Time dimension
    event_date Date,
    event_time DateTime,

    -- Identifiers (low to high cardinality)
    tenant_id UInt32,
    user_id UInt64,
    session_id String,

    -- Categorical data (use LowCardinality)
    event_type LowCardinality(String),
    country LowCardinality(String),
    device_type LowCardinality(String),

    -- Metrics
    duration_ms UInt32,

    -- Flexible data
    properties String,  -- JSON as string

    -- Skip indices for secondary lookups
    INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 3,
    INDEX idx_session session_id TYPE bloom_filter(0.01) GRANULARITY 3
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (tenant_id, event_date, user_id)
TTL event_date + INTERVAL 12 MONTH DELETE
SETTINGS index_granularity = 8192;

Anti-Patterns

Anti-Pattern	Problem	Solution
`ORDER BY (uuid, timestamp)`	High cardinality first	Put low cardinality columns first
`PARTITION BY toDate(ts)` for small tables	Too many small partitions	Omit partitioning or use monthly
`Nullable(UInt64)` everywhere	Storage and query overhead	Use DEFAULT values
`String` for status codes	Wastes space	Use `LowCardinality(String)` or `Enum`
`DateTime64(9)` always	Nanosecond precision rarely needed	Use `DateTime` or `DateTime64(3)`
Putting timestamp first in ORDER BY	Poor compression and filtering	Put categorical columns first

Verification Queries

-- Check table size and compression
SELECT
    table,
    formatReadableSize(sum(bytes_on_disk)) AS size,
    sum(rows) AS rows,
    round(sum(bytes_on_disk) / sum(rows), 2) AS bytes_per_row
FROM system.parts
WHERE active AND database = 'default'
GROUP BY table;

-- Check partition sizes
SELECT
    partition,
    count() AS parts,
    sum(rows) AS rows,
    formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE active AND table = 'your_table'
GROUP BY partition
ORDER BY partition;

-- Verify column types
SELECT name, type, compression_codec
FROM system.columns
WHERE database = 'default' AND table = 'your_table';

Troubleshooting

Always ask for user confirmation before applying schema changes (ALTER TABLE, recreating tables).

"Too Many Parts" Error

Problem: DB::Exception: Too many parts, inserts rejected, merge queue growing

Diagnose:

SELECT table, partition, count() AS parts
FROM system.parts
WHERE active AND database = 'default'
GROUP BY table, partition
HAVING parts > 300
ORDER BY parts DESC;

Solutions:

Cause	Fix
Over-partitioning (daily + high cardinality)	Use monthly partitions: `PARTITION BY toYYYYMM(date)`
Too many small inserts	Batch inserts: 1000+ rows per INSERT
High-cardinality partition key	Remove high-cardinality columns from PARTITION BY

-- Fix: Change from daily to monthly partitioning (requires table recreation)
CREATE TABLE events_new (...) PARTITION BY toYYYYMM(event_date) ...;
INSERT INTO events_new SELECT * FROM events;
RENAME TABLE events TO events_old, events_new TO events;
DROP TABLE events_old;

Poor Compression (<3x Ratio)

Problem: Table using more disk than expected, compression ratio below 3x

Diagnose:

SELECT
    column,
    type,
    formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
    round(sum(column_data_uncompressed_bytes) / sum(column_data_compressed_bytes), 2) AS ratio
FROM system.parts_columns
WHERE active AND table = 'your_table'
GROUP BY column, type
ORDER BY ratio ASC;

Solutions:

Cause	Fix
High-cardinality column first in ORDER BY	Reorder: low cardinality columns first
String for low-cardinality data	Use `LowCardinality(String)`
Wrong codec for data pattern	Use DoubleDelta for timestamps, Gorilla for floats
Random UUIDs in ORDER BY	Move UUID later in ORDER BY, or use different key

-- Check cardinality to decide LowCardinality usage
SELECT uniq(status) FROM events;  -- If <10K, use LowCardinality

-- Fix column type (requires recreation or new column)
ALTER TABLE events ADD COLUMN status_new LowCardinality(String);
ALTER TABLE events UPDATE status_new = status WHERE 1;
-- Then migrate queries to use status_new

Slow Queries Despite Good Schema

Problem: Queries slow even with proper ORDER BY and partitioning

Diagnose:

EXPLAIN indexes = 1 SELECT ... FROM your_table WHERE ...;
-- Check: Are granules being skipped? Is partition pruning happening?

Solutions:

Cause	Fix
Query doesn't filter on ORDER BY prefix	Add ORDER BY columns to WHERE clause
Function on filter column	Store computed column, filter on that
Missing skip index for secondary lookups	Add bloom_filter index
Selecting too many columns	Select only needed columns

-- Example: Table ORDER BY (tenant_id, event_date, user_id)

-- BAD: Skips ORDER BY prefix
SELECT * FROM events WHERE user_id = 123;

-- GOOD: Include prefix
SELECT * FROM events WHERE tenant_id = 1 AND event_date = today() AND user_id = 123;

-- ALT: Add skip index for direct user_id lookups
ALTER TABLE events ADD INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4;