Got paged at 3 AM last week. Payment service CPU spiked to 92%.

I rolled out of bed. Opened my laptop. Stared at Grafana for twenty minutes. Everything was fine. Checkout was working. Orders were processing. Customers were happy.

The CPU spike? Garbage collection. Totally normal. Lasted about 90 seconds.

I lost an hour of sleep for nothing.

This happens everywhere

I’ve worked with maybe thirty companies at this point. Same story at all of them.

Alert fires. Engineer wakes up. Engineer investigates. Turns out it’s nothing. Engineer goes back to bed angry. Repeat.

Most alerts are garbage. Not some. Most.

CPU thresholds. Memory usage. Disk space warnings. Pod restarts. None of these tell you if users are actually affected.

They’re infrastructure proxies. Guesses. “If CPU is high, something might be wrong.” Maybe. Or maybe your server is just doing its job.

The actual problem

We alert on system behavior instead of business outcomes.

Think about what actually matters. At an e-commerce company, it’s “can people buy things?” At a SaaS, it’s “can people use the product?”

But we don’t alert on that. We alert on databases and containers and queues. Then we try to infer whether users are affected.

That’s backwards.

Here’s what the bad version looks like:

- alert: PaymentServiceCPUHigh
  expr: container_cpu_usage > 80
  for: 5m
  labels:
    severity: page

This tells you nothing. CPU is high. So what? Is checkout broken?

The fix is obvious once you see it

Alert on outcomes. Not proxies.

- alert: CheckoutSuccessRateLow
  expr: |
    sum(rate(checkout_completed_total[5m]))
    /
    sum(rate(checkout_started_total[5m]))
    < 0.99
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "Checkout success rate dropped below 99%"

Now you’re alerting on something that matters. Checkout is failing. Wake someone up.

If checkout is fine but CPU is at 95%, that’s not a page. That’s a ticket for tomorrow. Or maybe nothing at all.

SLOs make this concrete

The cleanest version of this is SLO-based alerting. Define what “good” looks like for your users. Measure it. Alert when you’re burning through your error budget too fast.

I worked with a team last year that had 200+ alerts. After we switched to SLO-based alerting, they had 12. Twelve.

Their MTTR dropped by 60%. Not because they got faster at responding. Because they stopped responding to noise.

The hard part

Deleting alerts feels scary. “What if we miss something real?”

Here’s the thing. You’re already missing things. You’re missing them because your on-call is numb from all the false positives.

Every garbage alert trains people to ignore alerts. That’s way more dangerous than having fewer alerts.

Start with one service. Pick your most critical user journey. Define success. Alert only when that breaks. Watch what happens.

You’ll sleep better. I promise.


More on building alerts that don’t suck in my reliability cost breakdown.

— Youn