I was in a war room last month. Major outage at a Series C fintech.
Eight engineers. Four hours. All of them senior. The CTO was there too.
Everyone had their laptop open. Everyone was looking at their own service. Nobody could see the full picture.
“Database looks fine on my end.” “Auth service is green.” “Payments is healthy.”
Meanwhile, users couldn’t check out. For four hours.
Let’s do the math
Average senior engineer at a scaling startup: $180k to $220k salary. Let’s say $200k.
That’s about $100/hour, fully loaded it’s probably closer to $140.
Eight engineers for four hours: $4,500 minimum. Probably closer to $6,000.
One incident. One afternoon.
Now here’s the thing. This wasn’t a rare event. This was the third major incident that month. They have one or two every week.
Run the numbers. That’s $20,000 to $25,000 per month. Just in war room time. Just in salary costs.
Not counting lost revenue. Not counting customer churn. Not counting the burnout that makes your best people leave.
Why war rooms fail
Everyone’s looking at their own service.
The database team checks the database. It’s fine. The auth team checks auth. Also fine. Payments team, same story. Every individual piece looks healthy.
But the user’s checkout is broken. And nobody can see why.
Because the problem isn’t in any single service. It’s in the interaction between them. It’s in the cascade. Service A times out waiting for Service B, which is slow because Service C is backed up.
You can’t see that from individual dashboards. You need to follow the request.
The correlation problem
Most observability setups are siloed. Each team has their own dashboards, their own alerts, their own tools.
When something breaks, everyone retreats to their silo. “Not my service.” Everyone’s right. Everyone’s also useless.
The fix is correlation. One trace that shows the entire user journey. One view that connects all the dots.
When I pulled up distributed tracing at that fintech, the problem was obvious in 30 seconds. A retry loop between two services was creating cascading timeouts. Neither team could see it from their own metrics. But in the trace, it was right there.
Four hours of war room. 30 seconds with the right tool.
What distributed tracing actually means
Not just “we have Jaeger.” That’s not enough.
I mean traces that show business context. Not just “POST /api/v1/checkout” but “user 12345 attempting to purchase 3 items for $127.”
I mean traces that span all services. Not just the ones you remembered to instrument. Every hop, every queue, every database call.
I mean traces that are easy to find. When support reports “checkout failed for user X at 2:47pm,” any engineer should be able to pull that exact trace in under a minute.
That’s when war rooms become unnecessary. Because you don’t need eight people guessing. You need one person following the breadcrumbs.
The real cost
Every war room is a signal. It’s your system telling you it can’t be debugged.
You can keep paying the tax. $20k+ per month in senior engineer time. More in stress and attrition.
Or you can invest in observability that actually works. Traces that tell the story. Correlation that connects the dots.
The math isn’t complicated. The implementation is the hard part.
I wrote up the real costs of bad observability in more detail here. Including the stuff that doesn’t show up on spreadsheets.
— Youn