Got paged at 3am last Tuesday. Disk usage alert on a dev server nobody uses anymore.

That’s the third time this month. Different alert, same story. Noise.

I talked to the on-call engineer afterward. She said she’d already silenced 40+ alerts that shift. Just to stay sane.

That’s not monitoring. That’s spam.

The numbers are brutal

82% of organizations now have MTTR over an hour. That’s not my number. That’s from the 2026 State of Observability report.

An hour. To acknowledge, investigate, and fix.

We have more tools than ever. More data than ever. And we’re slower than ever.

Something broke.

How we got here

Around 2020, the advice was “collect everything.” Storage is cheap. You never know what you’ll need. Instrument all the things.

So everyone did.

# The 2020 approach
scrape_configs:
  - job_name: 'everything'
    static_configs:
      - targets: ['*:9090']  # Yep, all of it

alerting_rules:
  - alert: CPUHigh
    expr: cpu > 80  # On every single host
  - alert: MemoryHigh
    expr: memory > 80  # Every host again
  - alert: DiskHigh
    expr: disk > 80  # You see where this is going

Now multiply that by 500 services. Add some auto-scaling. Throw in a Kubernetes cluster or three.

Congratulations. You’ve built an alert cannon pointed at your on-call’s phone.

The real problem

Most alerts aren’t actionable. They’re FYI at best. Noise at worst.

CPU at 85%? Cool. Is it affecting users? No? Then why did you page someone?

Latency spike to 200ms? On what endpoint? Does anyone care about that endpoint?

Error rate at 0.1%? Out of how many requests? During a traffic spike? Is it actually a problem?

None of this context makes it into the page. Just “thing happened, good luck.”

What actually fixes this

I’ve helped a few teams dig out of alert hell. Same pattern every time.

Kill the threshold alerts. Use SLOs instead.

Don’t alert on CPU at 80%. Alert when your error budget is burning too fast.

# SLO-based alerting
- alert: ErrorBudgetBurnRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[1h]))
    /
    sum(rate(http_requests_total[1h]))
    >
    (1 - 0.999) * 14.4  # 14.4x burn rate = paging
  labels:
    severity: page
  annotations:
    summary: "Burning error budget too fast"

Now you’re alerting on user impact. Not infrastructure trivia.

Ask “what would the on-call do?” before creating any alert.

If the answer is “look at it, probably nothing,” delete the alert. Make it a dashboard instead. Or a weekly report. Or nothing.

Alerts are for action. Everything else is a log.

Business outcomes, not system metrics.

Page me when checkout is failing. Not when a database is using 80% memory. One is a problem. The other is a server doing its job.

The hard conversation

Most teams have 10x the alerts they need. But nobody wants to delete them. “What if we miss something?”

You’re already missing things. You’re missing them in a flood of noise.

Delete 90% of your alerts. See what happens. I promise, you’ll sleep better. And your MTTR will drop.

More on getting observability right in my guide on OTel pitfalls. The alerting section especially.

— Youn