Here’s a stat that should bother you: 82% of companies now take over an hour to resolve production incidents. That number was lower five years ago.

We have better tools. More dashboards. Fancier AI assistants. And we’re getting slower.

I’ve been thinking about why.

The tool explosion

Last month I audited a mid-size fintech’s observability stack. They had:

Datadog for APM
Prometheus for metrics
Loki for logs
Jaeger for traces
PagerDuty for alerts
Slack for communication
Notion for runbooks
A custom dashboard “because Datadog was too slow”

Eight tools. Eight browser tabs. Eight different query languages.

When an incident hits, their on-call engineer has to:

Check PagerDuty for the alert
Open Datadog to see APM data
Correlate with Prometheus metrics
Search Loki for relevant logs
Find the trace in Jaeger
Paste links into Slack
Look up the runbook in Notion

By step 4, they’ve lost the thread. I watched it happen in real-time during a shadow session. The engineer was competent. The tooling made them slow.

Context switching is the killer

Every tool hop costs you about 25 minutes of cognitive load. That’s not my number, that’s from research on task switching. And during an incident, you’re hopping constantly.

The irony is brutal. We added all these tools to see more. Instead, we see less because we can’t hold it all in our heads.

More dashboards, less understanding

I’ve seen teams with 200+ Grafana dashboards. When I ask which ones they actually use during incidents, they point to maybe three. The rest are dashboard graveyard. Created once for a demo, never deleted.

# This is what most alerting configs look like
groups:
  - name: everything-alerts
    rules:
      - alert: CPUHigh
        expr: cpu > 80
      - alert: MemoryHigh
        expr: memory > 80
      - alert: DiskHigh
        expr: disk > 80
      # ... 47 more alerts nobody reads

When everything alerts, nothing alerts. Your on-call just learns to ignore the noise.

What actually helps

The teams with fast MTTR share some patterns.

One observability platform. I don’t care which one. Pick something that does metrics, logs, and traces in one place. The correlation alone will save you hours per incident.

Business-logic traces. Stop tracing HTTP calls. Start tracing what users are trying to do. “User tried to checkout and failed at payment validation” is infinitely more useful than “POST /api/v2/checkout returned 500.”

Fewer, better alerts. An alert should mean “drop everything and look at this.” If it doesn’t, delete it. I’ve helped teams go from 200 alerts to 15. Their MTTR dropped by 60%.

Runbooks that are actually runnable. Not documentation. Actual steps. With commands. That someone updated in the last six months.

The hard truth

Your observability stack is probably making you slower. Not because the tools are bad, but because you have too many of them and not enough thought about how they fit together.

The fix isn’t adding another tool. It’s removing three.

If you’re fighting this right now, my OTel pitfalls guide covers the instrumentation side of unified observability.

— Youn