The Securonix analytics pipelines were dropping about 2,000 events per month. In a SIEM, dropped events aren't just an SLA metric. They're potential missed detections. An attack pattern that spans a sequence of events becomes invisible if any event in the sequence is dropped.
The drops were caused by transient Kafka broker issues: brief unavailability during leader elections, network hiccups, the usual distributed systems noise. The pipeline had no durability layer. If Kafka wasn't available at the moment of a write, the event was gone.
I built a replay buffer to fix this. The interesting decision was the storage backend.
Why retry logic alone doesn't work
The first instinct is to add retry logic with exponential backoff. Retry, wait, retry, wait. Eventually the broker recovers and the write succeeds. This handles transient failures that resolve within seconds.
But it doesn't handle failures that last longer than your retry window. And critically, if the process restarts during a retry sequence (deployments, OOM kills, node replacements) you lose the in-flight events regardless of broker availability. Retry logic is in-memory. A process restart clears it.
A proper solution requires durability: events that fail to publish must be persisted somewhere that survives a process restart, and replayed when the broker recovers.
Evaluating the options
- In-memory queue - fast, but loses data on process restart. Doesn't solve the problem.
- Redis - already in the stack, but adds a network hop. If Redis is also experiencing issues during the Kafka outage (not uncommon - network events affect multiple services), the buffer is unavailable.
- Postgres - full ACID guarantees, but heavyweight for a write-ahead buffer. Requires a network connection and another service to be healthy.
- SQLite - ACID guarantees, zero network dependency (file on local disk), zero infrastructure overhead, single-threaded writer model that eliminates concurrency bugs by design.
Why SQLite was right for this access pattern
The access pattern for a replay buffer is simple: one writer (the pipeline failure handler), one reader (the retry loop), infrequent reads relative to writes, and durability is mandatory. SQLite is designed for exactly this: local, embedded, ACID, single-process. It's not a toy database. It's the right database for local durability requirements.
The single-threaded writer constraint that sometimes makes SQLite a poor choice for concurrent applications is actually an advantage here. I have one writer. I don't want two threads racing to write to the buffer. SQLite's serialized writes give me correct behavior without any synchronization code.
The replay logic uses idempotent Kafka producer semantics: each event carries the same key as the original publish attempt, so Kafka's idempotent producer deduplicates retries at the broker. Duplicate events don't reach downstream consumers.
The best solution wasn't the most sophisticated one. It was the one that matched the access pattern.
The result
Pipeline failure incidents dropped from ~2,000/month to ~400/month. A 5x improvement. The events that previously would have been dropped are now durably buffered and replayed. The change was invisible to the downstream consumers. They just stopped seeing gaps.
I'd add a buffer depth metric (how many events are currently queued in the SQLite buffer) and alert on it. Under normal conditions the buffer should be empty or near-empty. A growing buffer is a leading indicator that Kafka publish failures are accumulating faster than the retry loop can drain them, which means something more serious is happening. I added this monitoring later; it should have been part of the initial implementation.