Reliability

Why I used SQLite as a Kafka replay buffer, and why it was the right call

kafka reliability sqlite securonix

The Securonix analytics pipelines were dropping about 2,000 events per month. In a SIEM, dropped events aren't just an SLA metric. They're potential missed detections. An attack pattern that spans a sequence of events becomes invisible if any event in the sequence is dropped.

The drops were caused by transient Kafka broker issues: brief unavailability during leader elections, network hiccups, the usual distributed systems noise. The pipeline had no durability layer. If Kafka wasn't available at the moment of a write, the event was gone.

I built a replay buffer to fix this. The interesting decision was the storage backend.

Why retry logic alone doesn't work

The first instinct is to add retry logic with exponential backoff. Retry, wait, retry, wait. Eventually the broker recovers and the write succeeds. This handles transient failures that resolve within seconds.

But it doesn't handle failures that last longer than your retry window. And critically, if the process restarts during a retry sequence (deployments, OOM kills, node replacements) you lose the in-flight events regardless of broker availability. Retry logic is in-memory. A process restart clears it.

A proper solution requires durability: events that fail to publish must be persisted somewhere that survives a process restart, and replayed when the broker recovers.

Evaluating the options

Why SQLite was right for this access pattern

The access pattern for a replay buffer is simple: one writer (the pipeline failure handler), one reader (the retry loop), infrequent reads relative to writes, and durability is mandatory. SQLite is designed for exactly this: local, embedded, ACID, single-process. It's not a toy database. It's the right database for local durability requirements.

The single-threaded writer constraint that sometimes makes SQLite a poor choice for concurrent applications is actually an advantage here. I have one writer. I don't want two threads racing to write to the buffer. SQLite's serialized writes give me correct behavior without any synchronization code.

The replay logic uses idempotent Kafka producer semantics: each event carries the same key as the original publish attempt, so Kafka's idempotent producer deduplicates retries at the broker. Duplicate events don't reach downstream consumers.

The best solution wasn't the most sophisticated one. It was the one that matched the access pattern.

The result

Pipeline failure incidents dropped from ~2,000/month to ~400/month. A 5x improvement. The events that previously would have been dropped are now durably buffered and replayed. The change was invisible to the downstream consumers. They just stopped seeing gaps.

What I'd do differently

I'd add a buffer depth metric (how many events are currently queued in the SQLite buffer) and alert on it. Under normal conditions the buffer should be empty or near-empty. A growing buffer is a leading indicator that Kafka publish failures are accumulating faster than the retry loop can drain them, which means something more serious is happening. I added this monitoring later; it should have been part of the initial implementation.

← Previous post ← All posts