ClickHouse: Aggregating High Cardinality Datasets

Published on Sunday, November 30, 2025

Separate Search and Data Queries

When writing a query its common to filter the set and select the data you want to receive from that set in the same query. When aggregating by high-cardinality datasets this becomes harder to do because of the physical constraints of the machine.

A simple solution to this problem is to search your dataset first, return some meaningful intermediate representation, and then using that intermediate representation query the data you need. In practice this looks something like this:

-- Preflight Query
SELECT agg_col
FROM table
GROUP BY agg_col

-- Data Query
SELECT *
FROM table
WHERE agg_col IN preflight_results
GROUP BY agg_col

This could be expressed other ways (e.g. a subquery) but separating the two can sometimes have nice properties and its easiest to understand for the purposes of the blog.

Stream Filters

Sometimes we can't remove data from our search query. If you want to know if the aggregation state of a column has some property then we're going to need to aggregate the column in some way.

There's the naive approach where we ask if google.com exists in the set of grouped urls:

has(groupArray(url), 'google.com')

And there's the streaming approach where we ask if some smaller symbolic value exists in the set.

sum(url = 'google.com') > 0

Both queries match the same aggregation states however the first filter consumes memory proportional to the sum of all urls in the database whereas the second filter consumes memory proportional to the number of aggregation states multiplied by a small constant factor.

Materialize Columns

Often times we want to know how many times a thing happened. This is useful in many contexts but particularly sorting. Suppose we had some column "error_ids" of type Array(String). If we were to sort the aggregation states by the number of errors per aggregation then we would need to write something like this:

sum(length(error_ids)) DESC

This appears fine at first and echoes what we did in the previous section but there's a new problem: transient memory. We need to read the error_ids column in order to determine its length. This is a time consuming (and memory intense) process. In an ideal world ClickHouse can clear this transient memory fast enough that it doesn't become a problem but, for whatever reason (probably very good reasons), this doesn't appear to be the case.

We can fix this by materializing the column in advance. We can make it so we'll only ever aggregate a small constant factor. If we materialize the column in our table schema like this:

`count_errors` UInt8 MATERIALIZED length(error_ids)

We can target the column directly in our filter or sort expression.

sum(count_errors) DESC

Set Limits

ClickHouse exposes two query configuration parameters called max_rows_to_group_by and group_by_overflow_mode. You can set these parameters to some amount to cap the amount of memory used.

group_by_overflow_mode is the lesser of the two. You can explore the documentation on your own time. I find it doesn't contribute materially to performance though it may have behavioral implications for your use case. Set it to either break or any.

max_rows_to_group_by is much more impactful. You set it to a integer value such as 1,000,000. It caps the number of aggregation states a single worker will retain and by extension caps the amount of memory you will consume. Ideally this number is as high as possible while still delivering an acceptable experience within a reasonable budget. Some exploration is needed here to determine the value you should adopt.

Measure Improvement

As always, when optimizing anything, measure the performance of your changes. Be methodical. Below is a query and its output for measuring the duration, memory-usage, and number of rows read of a query:

SELECT
    query_duration_ms,
    read_rows,
    read_bytes,
    memory_usage,
    query
FROM system.query_log
ORDER BY event_time DESC
LIMIT 10

┌─time_ms─┬─read_rows─┬─read_bytes─┬─memory_usage─┬─query────────────────┐
│      24 │    245760 │    7047494 │     14303428 │ WHERE old = 'Safari' │
│      14 │    286720 │    4874860 │      8743218 │ WHERE new = 'Safari' │
└─────────┴───────────┴────────────┴──────────────┴──────────────────────┘

Furthermore, ClickHouse ships a benchmarking tool. If you're using the docker image you can execute it by calling the following command: docker exec clickhouse /usr/bin/clickhouse-benchmark. This tool will run your query in a loop measuring its average and p95 times and return various other useful metrics.

Don't Aggregate

Bonus tip. Don't aggregate. You will know things about your dataset that the SQL planner does not. If you receive a query that can be filtered or sorted without applying the GROUP BY clause you should. If this is a customer facing feature then you'll need a sophisticated query optimizer but these things can be easy to handle if applied on a case-by-case basis.