Performance

How a Redis network bottleneck cost $2.3M/year - and how I found it

redis aws elasticache performance securonix

This is a story about finding a problem nobody had formally identified, understanding it well enough to fix it correctly, and then having to convince an AWS service team to help you fix it. The technical part was interesting. The organizational part was harder.

The background

At Securonix, Redis was used heavily across the SIEM analytics pipeline: third-party data lookups, whitelist tables, enrichment caches. The Indicator Extraction Engine (IEE) ran at high parallelism: many executor threads simultaneously hitting Redis for lookups on every incoming security event.

Individually, each Redis read was fast. The problem was invisible until you looked at the aggregate.

Finding the bottleneck

I noticed anomalous Redis latency spikes in the metrics. Not high baseline latency, but intermittent spikes that correlated with IEE executor load. The spikes weren't CPU-correlated or memory-correlated. They were network-correlated.

The diagnosis: the IEE executor was generating a high volume of concurrent Redis read requests. Each read was small and fast in isolation, but the aggregate network I/O was saturating the Redis node's NIC. We were running on Redis node types that were not network-optimized: instances with relatively modest network throughput ceilings. We were paying for compute we didn't need and starving on network capacity we did.

The retry logic in the pipeline was masking the problem: transient timeouts caused by NIC saturation were being retried successfully most of the time, so the error rate was low. But the latency spikes were real, and the infrastructure was running inefficiently.

Building the case

I had to make two arguments: one technical (here's what's happening and here's the fix), one financial (here's what it's costing and here's what we save).

The technical argument was straightforward once I had the NIC saturation diagnosis: migrate to network-optimized ElastiCache node types, and add read replicas to distribute the concurrent read load across multiple nodes rather than concentrating it on one.

The financial argument required working out the cost difference between current node types and the proposed configuration, then modeling the projected savings. The number came out to $2.3M/year, large enough that it warranted a direct conversation with the AWS ElastiCache service team rather than a support ticket.

Working with AWS

The AWS ElastiCache team was genuinely helpful once I had a clear diagnosis and a specific ask. This is worth noting because 'work with AWS to reduce costs' sounds like a vague escalation. It wasn't. I came with: here is the observed behavior, here is my diagnosis of the cause, here is the configuration change I'm proposing, and here is the projected cost impact. That specificity made the conversation productive.

They confirmed the diagnosis and helped us validate the migration approach. We moved to network-optimized nodes and added read replicas. The latency spikes disappeared. The bill dropped.

$2.3M/year reduction. Throughput actually improved once the bottleneck was gone.

What I learned

The most important thing I learned from this project is that retry logic is a liability disguised as a feature. The retries were hiding a real problem. If I hadn't been reading the metrics carefully, the NIC saturation would have continued indefinitely. The pipeline 'worked,' customers weren't complaining, there was no incident. Just an invisible $2.3M/year tax on doing the right thing the wrong way.

Read your metrics. Not just the alert-triggering ones.

← Previous post Next: Replacing Datadog with Timestream →