Why I chose Flink over Spark

When I joined SecurityScorecard, one of the first decisions I had to make was which streaming framework to build on. The team had a clean slate: no existing pipeline, no legacy to maintain, just a requirement to process billions of security observations daily with low latency.

The easy answer was Spark Structured Streaming. It's mature, well-documented, and I'd used it for years at Securonix. But I'd also watched it cause a $300K/month Datadog bill once our Spark apps scaled to production volume. I knew exactly why. I wasn't going to make the same mistake twice if I had the chance to start fresh.

So I chose Apache Flink. And then I had to prove the choice was right.

The problem with Spark Structured Streaming at scale

Spark Structured Streaming is micro-batch under the hood. It looks like streaming: you write streaming code, you get streaming semantics. But underneath it's running many small batch jobs in sequence, typically every 100–500ms. This is fine for a lot of workloads. But it creates two problems I'd felt directly at Securonix.

First, there's a latency floor you cannot engineer around. If your batch interval is 200ms, your minimum end-to-end latency is 200ms plus processing time. For a security platform where detection latency matters, that's a hard ceiling you bump into.

Second, Spark holds all state in JVM memory. For small state this is fine. For stateful enrichment at billions of events per day, when you're joining observation streams with large reference datasets, you're either capping your state to what fits in memory or fighting GC pressure and spill-to-disk behavior that Spark doesn't handle gracefully.

Flink solves both. It processes each record as it arrives. True record-at-a-time, no batching. And its state backend is pluggable: I used RocksDB, which stores state as an LSM tree on local disk, spills gracefully, and puts zero pressure on the JVM heap. You can have terabytes of state in Flink without touching the garbage collector.

Building MOP-v2: the 4x throughput problem

The first major test of this decision was MOP-v2, the Measurements-Observations Pipeline version 2. The requirement: sustain 35M+ records per minute from Kafka into ClickHouse. We were starting from 9M. That's a 4x improvement, and the constraint was no new infrastructure.

I want to be honest about what this kind of work actually looks like. It's not one clever insight that unlocks the performance. It's about 50 deploy-measure-adjust cycles over several weeks. Here's the shape of what I went through:

Kafka consumer tuning

The first bottleneck was consumer throughput. I tuned fetch sizes, polling intervals, heartbeat timeouts, and max poll records, measuring the delta on each change before moving to the next. The goal was to maximize how much data each consumer could pull per cycle without overwhelming downstream processing.

ClickHouse insert optimization

The second bottleneck was write throughput. ClickHouse's MergeTree storage engine merges data parts in the background. Write too many small parts faster than it can merge them and you hit a 'too many parts' error and writes start failing. The fix was batching: accumulate records and write them as one part. Fewer, larger parts means the merge process can keep up.

I also moved to async writes with connection pooling. The Flink operator thread shouldn't block waiting for a ClickHouse acknowledgment. It should hand off the batch to an async flush thread and keep processing. This required careful synchronization at the flush boundary, which I'll come back to.

Parallelism tuning

More Flink task slots means more parallel consumers, which means more throughput, up to a point. I found the ceiling where adding more parallelism stopped helping: the ClickHouse write path became the bottleneck instead of the consumer path. At that point, adding more consumers just created more write contention. The right answer was to tune both in concert.

The thread safety bug

The async flush design introduced a subtle race condition. The Flink operator thread was filling a batch buffer while the flush thread was draining it. Without explicit synchronization at the swap boundary, you'd occasionally get partial writes or buffer corruption under high load. I caught this in load testing, not in production. Only because I was looking for it. The fix was a synchronized batch swap: fill one buffer, swap it atomically with an empty one, flush the old one. Simple in retrospect, invisible if you don't think through the concurrency model.

S3 checkpoints and observability

The last pieces were operational. I moved checkpoint storage to S3 so that a pipeline restart doesn't reprocess hours of data. The checkpoint records exactly where we were in each Kafka partition and what state the operators held. And I added consumer lag metrics via SLF4J so future tuning iterations had a proper feedback loop instead of guessing.

The result: 9M → 35M+ records/min. 4x improvement, zero new infrastructure, ~75% reduction in per-message processing cost.

The Confluent podcast

This work, along with the schema-first Kafka architecture that made it possible, ended up being featured on Confluent's podcast, Life Is But A Stream, Episode 20. I talked through how SSC's approach cut observation processing time from a weekend to near real-time. It was a good conversation, and it was a useful forcing function: explaining your architecture to an external audience will quickly surface any hand-waving in your own understanding.

What I'd do differently

I'd instrument earlier. The first few weeks of tuning were partially blind because I didn't have good consumer lag visibility from the start. Adding observability as an afterthought means your first N tuning cycles are less informed than they should be. Now I treat observability as part of the initial implementation, not a follow-on.

I'd also document the parallelism ceiling more explicitly in the codebase. Future engineers who increase Flink parallelism will hit the ClickHouse write bottleneck and wonder why it didn't help. A comment explaining the relationship between Flink parallelism and ClickHouse write concurrency would save someone a week.

Why I chose Flink over Spark, and what it took to prove it was worth it

The problem with Spark Structured Streaming at scale

Building MOP-v2: the 4x throughput problem

Kafka consumer tuning

ClickHouse insert optimization

Parallelism tuning

The thread safety bug

S3 checkpoints and observability

The Confluent podcast