“We’re killing Datadog.”
VP of Engineering said it three months ago. Series C startup, planning meeting. Moving everything to Grafana Cloud. Room full of nods. Finally.
I checked in last week. They now have Datadog, Grafana Cloud, and somehow picked up New Relic along the way.
Budget went up 40%.
Everyone’s consolidating. Nobody finishes.
Every team I’ve worked with in the past two years has been “actively consolidating.” Every single one still runs multiple observability stacks.
Tool sprawl isn’t a technical problem. It’s a people problem. And you can’t fix people problems with a migration plan.
The sunk cost trap
“We already invested 6 months setting up Datadog.”
I hear this in every consolidation meeting. Some engineer spent half a year wiring up dashboards. Custom queries. Team knowledge baked into panel layouts nobody documented.
Nobody wants to throw that away.
But that 6 months is gone. The only question: what’s the right tool going forward?
That’s not how brains work though. You keep paying Datadog because you already paid Datadog. Classic sunk cost.
And here’s the dark part: you’ll spend another 6 months on the “new” tool anyway. Learning curve doesn’t disappear. You’re just resetting the clock while running two bills.
Team politics nobody admits
Slack thread from last week:
#platform-eng
Sarah: What are we using for tracing again?
Mike: Datadog APM
Chen: I thought we moved to Honeycomb?
Mike: Backend did. Frontend uses Sentry.
DevOps: Prometheus, obviously. It's free.
Sarah: ...so all of them?
Everyone has valid reasons. Everyone’s optimizing for their own workflow. Nobody’s optimizing for the company.
I watched a 50-person engineering org run four observability stacks. Each team had their favorite. Leadership let it slide because “we trust our engineers.”
Then an incident crossed three teams. Each one debugging in their own tool. No shared context. No common trace ID. Three groups shouting different metrics at each other in Slack.
Took 4 hours. Should’ve been 20 minutes.
The “what if” hoarding problem
“What if we need that data later?” “What if the new tool doesn’t have this feature?” “What if we’re wrong?”
So you keep everything. Just in case.
Series B company I worked with had $8k/month in “just in case” tools. Tools nobody logged into. Tools for problems that might exist someday.
They cut them all. Nothing broke. Nobody noticed.
That “what if” fear? $96k a year, in their case.
Consolidation becomes the problem
This is my favorite pattern.
Teams start consolidating to save money. They pick Grafana Cloud. Migration takes longer than expected. Meanwhile, a new team spins up and needs monitoring now. They grab Datadog because it’s faster.
Now you’re running both. Plus whatever New Relic trial someone forgot to cancel.
The consolidation project created more sprawl than it fixed.
You didn’t avoid lock-in. You tripled it. Your Datadog dashboards can’t migrate. Your Honeycomb queries use custom syntax. Your Sentry integrations need to be rebuilt from scratch.
Pick one. Go deep. If you switch later, at least you’re only switching once.
The cost nobody tracks
The tools cost money. $50k, $100k, $200k depending on scale.
But that’s not the real cost.
Your on-call engineer gets paged at 2am. Opens Datadog. Checks APM. Nothing obvious. Opens Sentry. Checks errors. Maybe related. Opens CloudWatch. Checks Lambda. Opens Honeycomb for traces.
Each tool switch: 2-3 minutes of mental reset. Re-authenticating context. Building a mental model from scratch in a different UI.
Four tools. Four switches. 10 minutes of pure waste before debugging even starts.
At 3am, with customers screaming, those 10 minutes feel like an hour.
What actually works
Stop treating this as a tool decision. It’s a leadership decision.
Pick an owner. Not a committee. One person with authority to say “we’re using X, period.”
Set a deadline. Not “when we have time.” Pick a date. Anything on the old tool after that date gets turned off.
Accept the loss. Those dashboards someone spent 6 months on? Thank them. Delete the dashboards. Move on.
Measure what matters. MTTR before and after. Context switches per incident. Time to first meaningful data.
I’ve seen this work exactly twice. Both times, a VP personally drove the migration. Made it their OKR. Took the heat when teams complained.
Every other attempt? Still running four tools.
The tooling matters less than you think. What matters is whether your team can debug an incident in one place, with one mental model, in one UI.
Everything else is expensive noise.
I wrote more about the hidden costs of observability chaos here.
— Youn