I was on a call last week with a Series C company. Their monitoring setup looked solid. Datadog, Prometheus, the works. All green.

Meanwhile, their support queue had 40 tickets about failed checkouts.

The problem nobody talks about

When you set up monitoring, you’re usually checking:

  • Is the server responding?
  • Is the latency acceptable?
  • Is the error rate low?

Cool. But none of that tells you if the checkout actually worked.

Here’s what I mean. Your payment service can return 200 OK with this body:

{
  "status": "failed",
  "reason": "card_declined"
}

Your monitoring sees: successful request, 200ms latency, no errors. Your customer sees: broken checkout, rage, churn.

The gap

This is what I call the “green dashboard gap.” The distance between what your infrastructure metrics say and what your business is actually doing.

Most monitoring tools are built for infrastructure. They answer “is the server up?” but not “did the user accomplish their goal?”

What actually works

You need to instrument business outcomes, not just HTTP calls.

Instead of:

trace: POST /api/checkout -> 200 OK (180ms)

You want:

trace: checkout.attempt
  -> payment.process: success
  -> inventory.reserve: success
  -> email.confirmation: sent
  -> checkout.complete: true

Now when something breaks, you see which step failed. Not just “the API returned 200.”

The hard part

This means touching application code. You can’t just install an agent and call it done. Every critical flow needs explicit instrumentation.

Most teams don’t have time for this. Or they try, get 60% coverage, and still miss the important stuff.

That’s usually when I get the call.


If this sounds familiar, I wrote a technical guide on OTel pitfalls that goes deeper into the implementation side.

— Youn