Here’s a stat that should bother you: 82% of companies now take over an hour to resolve production incidents. That number was lower five years ago.
We have better tools. More dashboards. Fancier AI assistants. And we’re getting slower.
I’ve been thinking about why.
The tool explosion
Last month I audited a mid-size fintech’s observability stack. They had:
- Datadog for APM
- Prometheus for metrics
- Loki for logs
- Jaeger for traces
- PagerDuty for alerts
- Slack for communication
- Notion for runbooks
- A custom dashboard “because Datadog was too slow”
Eight tools. Eight browser tabs. Eight different query languages.
When an incident hits, their on-call engineer has to:
- Check PagerDuty for the alert
- Open Datadog to see APM data
- Correlate with Prometheus metrics
- Search Loki for relevant logs
- Find the trace in Jaeger
- Paste links into Slack
- Look up the runbook in Notion
By step 4, they’ve lost the thread. I watched it happen in real-time during a shadow session. The engineer was competent. The tooling made them slow.
Context switching is the killer
Every tool hop costs you about 25 minutes of cognitive load. That’s not my number, that’s from research on task switching. And during an incident, you’re hopping constantly.
The irony is brutal. We added all these tools to see more. Instead, we see less because we can’t hold it all in our heads.
More dashboards, less understanding
I’ve seen teams with 200+ Grafana dashboards. When I ask which ones they actually use during incidents, they point to maybe three. The rest are dashboard graveyard. Created once for a demo, never deleted.
# This is what most alerting configs look like
groups:
- name: everything-alerts
rules:
- alert: CPUHigh
expr: cpu > 80
- alert: MemoryHigh
expr: memory > 80
- alert: DiskHigh
expr: disk > 80
# ... 47 more alerts nobody reads
When everything alerts, nothing alerts. Your on-call just learns to ignore the noise.
What actually helps
The teams with fast MTTR share some patterns.
One observability platform. I don’t care which one. Pick something that does metrics, logs, and traces in one place. The correlation alone will save you hours per incident.
Business-logic traces. Stop tracing HTTP calls. Start tracing what users are trying to do. “User tried to checkout and failed at payment validation” is infinitely more useful than “POST /api/v2/checkout returned 500.”
Fewer, better alerts. An alert should mean “drop everything and look at this.” If it doesn’t, delete it. I’ve helped teams go from 200 alerts to 15. Their MTTR dropped by 60%.
Runbooks that are actually runnable. Not documentation. Actual steps. With commands. That someone updated in the last six months.
The hard truth
Your observability stack is probably making you slower. Not because the tools are bad, but because you have too many of them and not enough thought about how they fit together.
The fix isn’t adding another tool. It’s removing three.
If you’re fighting this right now, my OTel pitfalls guide covers the instrumentation side of unified observability.
— Youn