Every Series B-D company has one. Sometimes two.

The engineer who knows everything. The one who gets pulled into every incident. The person everyone tags in Slack when something’s weird.

They’re fast. They’re brilliant. They can look at a stack trace and just know where the problem is.

They’re also a massive liability.

I’ve seen this kill companies

Not literally. But close.

Last year I worked with a fintech that had been running fine for three years. Then their senior platform engineer left for a FAANG. Took a month to backfill.

In that month, they had two major incidents. Both took 4+ hours to resolve. Both would’ve been 20-minute fixes for the person who left.

Nobody else understood the system. Not really. They knew their piece. But the whole picture? That lived in one person’s head.

It’s not a people problem

The obvious reaction is “we need to hire better” or “we need more documentation.”

Neither works.

You can’t hire your way out of a visibility problem. New engineers will just become dependent on the hero too. And documentation? Nobody reads it. Especially not at 3 AM during an incident.

The real problem is that your system is opaque. Understanding requires tribal knowledge. You can’t just look at it and figure out what’s happening.

What observable actually means

Observable doesn’t mean “has dashboards.” It means anyone can answer questions about system behavior by looking at the data.

That’s a high bar. Most systems fail it completely.

Can a new engineer trace a request from the frontend to the database and back? Can they see where time is spent? Can they understand why something failed without asking someone?

If the answer is no, you have a hero problem. Even if your hero hasn’t left yet.

The fix takes effort

There’s no shortcut here. You have to make the invisible visible.

Proper distributed tracing. Not just logs with request IDs. Real traces that show the full journey. With spans that actually mean something.

# Bad: nobody knows what this does
with tracer.start_span("process"):
    do_stuff()

# Better: actually useful
with tracer.start_span("validate_payment_method",
    attributes={
        "payment.type": payment_type,
        "payment.provider": provider,
        "user.tier": user.tier
    }):
    validate_payment(payment)

Runbooks that work. Not wikis that haven’t been updated since 2023. Runbooks that are tested, linked from alerts, and actually help during incidents.

Structured spans, not just metrics. When something breaks, you need to understand the specific request that failed. Aggregates don’t help. You need the details.

The test

Here’s how you know if you’ve fixed it.

Take your most junior engineer. Page them for something they’ve never seen before. Watch how they debug it.

If they can make progress without asking the hero, you’re good.

If they’re stuck until someone wakes up to help, you have work to do.

Most teams fail this test. But it’s the only test that matters.

Your hero will thank you

The irony is that heroes don’t want to be heroes. They’re tired. They’re burned out. They can’t take vacation without their phone blowing up.

Making your system observable isn’t just risk management. It’s how you keep your best engineers from quitting.

Give them a system that others can understand. Give them a week off without pages. That’s worth more than any title or bonus.


I wrote more about getting tracing right in my OTel pitfalls guide. The span naming section is especially relevant here.

— Youn