Observability

Replacing $300K/month of Datadog with AWS Timestream

observability aws timestream spark securonix

In the first month after Securonix's Spark analytics applications scaled to production volume, the Datadog bill was $300K. That's not a typo. Annualized, it was on track for $3.6M/year. Just for metrics and dashboards.

The standard response to a SaaS bill this size is to negotiate pricing, reduce log volume, or accept it as the cost of doing business. I had a different take: we were paying Datadog to store and query time-series data at scale. AWS Timestream does exactly that at a fraction of the cost. The observability value didn't have to go away. The vendor did.

What Datadog was actually doing for us

It's worth being precise about this, because replacing an observability tool without maintaining observability fidelity is just trading one problem for another.

We were using Datadog for: custom metrics from Spark executors (throughput, error rates, record processing rates), consumer lag monitoring, processing latency histograms, and Grafana-style dashboards for the operations team. These weren't vanity metrics. The ops team used them daily for incident detection and pipeline debugging.

The replacement stack

Custom metric collectors in Spark

Rather than using Datadog's agent, I instrumented the Spark applications directly. Each executor emits structured metrics (throughput, error counts, consumer lag, processing latency) to a Kinesis stream. This gave us control over exactly what was being emitted and how, rather than depending on Datadog's agent to capture the right signals.

AWS Timestream as the backend

Timestream is a managed time-series database with two storage tiers: a memory store for recent data (fast reads, higher cost) and a magnetic store for historical data (slower reads, much lower cost). Data automatically transitions between tiers based on a configurable retention policy. For our use case, fast access to the last 24 hours and slower access to the last 30 days, it was a perfect fit.

The query interface is SQL-based, which made it easy for the team to write ad-hoc queries for debugging without learning a new query language.

Grafana connected to Timestream

Grafana has a native Timestream data source plugin. I rebuilt the dashboards the ops team depended on (pipeline health, consumer lag, error rates) with the same granularity and alert thresholds as the Datadog equivalents. The transition was invisible to the team from a workflow perspective.

$300K/month Datadog spend replaced. Zero reduction in observability coverage. The team retained all critical metrics, dashboards, and alerting.

The non-obvious tradeoffs

Datadog has features Timestream doesn't: distributed tracing, log aggregation, APM. We weren't using those features at the scale that justified the cost, but if we had been, the math would have been different. This replacement made sense for our specific workload: high-volume custom metrics from Spark applications. It wouldn't be the right call for a team that's deeply invested in Datadog's tracing and log correlation features.

The other tradeoff: Timestream requires more operational ownership than Datadog. Datadog handles retention, scaling, and querying for you. With Timestream, you manage retention policies, configure tiering, and write your own queries. For a team with the engineering capacity to maintain this, it's fine. For a smaller team, the operational overhead might outweigh the cost savings.

← Previous post Next: SQLite as a Kafka replay buffer →