Last month I talked to a VP of Engineering at a Series C company. She was frustrated.
“My best engineer spent 15 hours last week debugging a database timeout. We hired him to build our recommendation engine.”
I hear this every week. Same story, different company.
The pattern
Here’s what happens at every fast-growing startup between Series B and D:
Your senior engineers become your only responders. Not by policy. By necessity.
They’re the only ones who understand how the system actually works. They’ve been there since the early days. They hold the architecture in their heads. When something breaks, they’re the only ones who know where to look.
So they get pulled into every incident. 30% of their time, easy. Sometimes more.
Meanwhile, juniors sit on the sidelines. Not because they’re lazy. Because they literally can’t help. They don’t have the context. They don’t know which service talks to which. They don’t know why that particular query sometimes hangs.
The math hurts
Your senior engineers cost $180k to $250k. Maybe more if you’re in the Bay.
30% of that time going to firefighting is $50k to $75k per engineer per year. Just in salary. Not counting the features they’re not shipping.
You hired them to build. They’re debugging instead.
Why this happens
It’s a visibility problem.
Your system isn’t observable. Not really. You have dashboards, sure. Metrics, logs, maybe some traces. But none of it tells the story.
When an incident happens, someone needs to correlate everything manually. Check this dashboard. Compare it to that log. Grep through another service. Build the picture in their head.
That someone is always your senior engineer. Because they’re the only ones who know how to connect the dots.
Juniors can’t debug because the system doesn’t explain itself. So seniors stay stuck in the loop.
The actual fix
Make your system debuggable by someone who doesn’t have tribal knowledge.
This means:
Real distributed tracing. Not “we have Jaeger somewhere.” Actual end-to-end traces that show a request’s full journey. With business context attached. So when checkout fails, any engineer can pull up the trace and see exactly where it broke.
Correlated data. Logs that link to traces. Traces that link to metrics. One click from “weird spike” to “here’s the actual request that failed.”
Runbooks that point to data. “Error rate high on checkout” should link directly to the dashboard that shows what’s happening. Not require 20 minutes of tribal knowledge to find.
When your system explains itself, juniors can debug basic issues. They don’t need to page the senior at 2am. They can see what’s happening and follow the trail.
Your seniors go back to building.
One more thing
This isn’t just about tooling. It’s about culture. Most teams bolt on observability as an afterthought. “We’ll add traces later.” Later never comes.
The teams that get this right treat observability as a feature. It’s part of the definition of done. You don’t ship a service without proper instrumentation.
Because a service that can’t be debugged by your junior engineers isn’t production-ready. It’s a liability.
More on getting the instrumentation right in my guide to OTel pitfalls. Common mistakes and how to avoid them.
— Youn