Architecture

Designing a reporting layer on 18 billion rows without exploding your infrastructure

clickhouse architecture data-modeling

When the PV2 Vendor Risk Management product needed a reporting layer (dashboards showing observation trends, vendor risk distribution, threat levels), the raw material was 18.1 billion rows in ClickHouse. The question wasn't whether we could query it. The question was how to do it without a dedicated cluster that cost more than the feature was worth at this stage.

The answer required actually looking at the data before making any architecture decisions.

Profile first, design second

The instinct when asked to design a reporting layer is to immediately start thinking about materialized views, pre-aggregations, or a separate OLAP store. I resisted that instinct and spent time profiling the production data first. This is the step that most engineers skip, and it's the step that saved us ~170 GB of storage and an entire infrastructure deployment.

The row fan-out analysis

The initial design proposed pre-joining observations with vendor metadata and storing the result, essentially denormalizing for query speed. I calculated what that would actually produce: a 6.83x row multiplication. 18.1B rows × 6.83 = ~123B rows, requiring ~170 GB of additional storage. And the query performance benefit was marginal. A well-indexed query-time JOIN in ClickHouse would be nearly as fast, because ClickHouse's columnar storage makes joins much cheaper than in row-store databases.

I chose the query-time JOIN and documented the calculation. 170 GB saved. No denormalization.

The 29x data reduction

The second finding from profiling: not all 18.1B rows are relevant for Iteration 1 reporting widgets. The widgets we were building needed data filtered to specific threat levels. By applying this filter at ingestion time, only materializing rows above a certain threat level into the reporting namespace, I reduced the dataset from 18.1B rows to 619M rows. A 29x reduction that covered every Iteration 1 requirement.

The architecture decisions

DecisionAlternativeChoiceWhy
StorageQuery raw observationsDedicated reporting namespace (39 GB)Isolate reporting load; zero risk to production writes
Join strategyPre-join + store (123B rows)Query-time JOIN170 GB saved; ClickHouse columnar makes it fast enough
Ingestion scopeAll 18.1B rowsThreat level filter → 619M rows29x reduction covers all Iteration 1 widgets
PipelineFlink streaming pipelineClickHouse Materialized Views (Iteration 1)Zero new infrastructure; Flink path documented for scale
Query cachingClickHouse onlyCube.dev semantic layer<500ms perceived latency on cache hit

ClickHouse Materialized Views for Iteration 1

ClickHouse has a nice property: you can define a materialized view that's triggered by inserts into a source table. Whenever new observations land in the base table, the MV automatically aggregates them into the reporting table. This gives you near-real-time reporting data without a separate Flink pipeline.

This was the right choice for Iteration 1 because it required zero new infrastructure and the latency was acceptable. The MV runs within seconds of the base insert. For Iteration 2, if query patterns get more complex or the data volume grows beyond what ClickHouse MVs can handle efficiently, the migration path to a proper Flink enrichment pipeline is already documented.

The temptation in situations like this is to build the full solution from the start. But Iteration 1 widgets had specific, bounded query patterns. Building Flink + Cube.dev + a full semantic layer for requirements that a ClickHouse MV handles correctly is engineering for imagined future scale, not actual current requirements.

Cross-team coordination

This project had an aspect that was new for me: the architecture document had to define work for teams other than mine. The reporting layer spans four epics: Data Taxonomy, Flink Stream Enrichment (mine), ClickHouse Schema (partner team), and Cube.dev Semantic Layer (partner team). I had to define the technical contracts between layers precisely enough that teams could build in parallel without waiting on each other.

The forcing function for getting this right is asking: if a team implements their layer exactly as I've specified, will it connect correctly to what I'm building? And if it won't, where is the ambiguity in my spec that allows the misconnection?

The most valuable part of this project wasn't the architecture. It was profiling the data before committing to the architecture. Numbers change decisions. Intuitions about 18 billion rows are often wrong.

← Previous post Next: Schema publishing for three languages →