I had coffee with a VP of Engineering last week. Series C company, 80 engineers, growing fast. They’ve been looking for a senior SRE for four months.
Four months. And they’re not even close.
The unicorn problem
Here’s what their job posting asks for: Kubernetes expertise (EKS specifically), OpenTelemetry instrumentation, incident response experience, and—this is the kicker—someone who can actually read and modify application code.
That person exists. I’ve worked with maybe six of them in my career. They cost $200k+ in the Bay Area. More in NYC. And they’re already employed somewhere they like.
The realistic timeline? 3-6 months to find someone. Another 2 months to onboard them properly. Your systems don’t stop degrading while you search.
What happens in the meantime
I see this pattern constantly. The team is underwater. Alerts are noisy. On-call rotations are burning people out. Some critical service has silent failures that nobody’s figured out yet.
But hiring takes priority. So the fires keep smoldering.
One company I talked to had a memory leak in their payment service for three months. They knew about it. They just didn’t have anyone with the time or expertise to actually trace it. By the time they hired someone, it had caused two outages and one very awkward conversation with their biggest customer.
The gap nobody plans for
Here’s what I tell people: your SRE hire is a long-term investment. Good. Keep looking. Don’t settle.
But what’s your plan for the next 90 days?
Most don’t have one. They just hope nothing breaks too badly. Or they ask their backend engineers to “look into the observability stuff” on top of their actual jobs.
That’s how you get dashboards that lie. Metrics that don’t correlate. Alert rules copied from Stack Overflow that fire at 3am for non-issues.
An alternative
I’ve started doing short engagements—4-6 weeks—specifically for this gap. Come in, find the three things that are actually broken, fix them, document what I did.
It’s not a replacement for a full-time hire. It’s what you do while you’re searching.
Last month I spent four weeks with a fintech startup. Fixed their trace propagation (it was silently dropping context at their API gateway). Rebuilt their alerting to actually be useful. Set up the OTel instrumentation their next hire will inherit.
When their SRE finally starts, they won’t spend their first two months untangling a mess. They’ll have something to build on.
The math
Four weeks of focused work vs. eight months of degraded reliability while you search.
I know which one I’d pick.
If you’re curious about common OTel issues I see during these engagements, I wrote about them in my guide to OTel pitfalls.
— Youn