I reviewed a job posting last week. Senior SRE role. The requirements section had 23 bullet points.
Kubernetes. Terraform. Prometheus. Grafana. AWS. GCP. Python. Go. Rust. OpenTelemetry. Datadog. PagerDuty. Incident command. Chaos engineering. SLO design. Capacity planning.
Nobody has all of this. The posting isn’t a requirements list. It’s a wishlist written by committee.
The real split
In my experience, SREs fall into two categories.
The first: YAML-movers. They can configure the hell out of any system. Helm charts, Terraform modules, Prometheus rules. They keep infrastructure running. Valuable, but limited.
The second: SREs who can touch application code. They can read a Go service, understand why it’s leaking goroutines, and actually fix it. They can add tracing to a Python codebase without breaking everything.
The second type is rare. Really rare. And they’re the ones who actually solve the hard problems.
When your metrics look fine but customers are complaining, you need someone who can read the code. Not just the dashboards.
The OTel question
Here’s a filter I’ve started using: Can they instrument a service with OpenTelemetry at the code level?
Not “have you used OTel.” Anyone can install an auto-instrumentation library.
Can they add custom spans to a critical code path? Do they understand context propagation? Can they debug why traces are getting disconnected between services?
That’s where the real debugging happens. And most SRE candidates can’t do it.
Red flags in interviews
Things that make me skeptical:
- They can’t explain a production incident without blaming someone else
- All their debugging stories start with “we looked at the dashboard”
- They’ve never had to read application code to solve an infrastructure problem
- They describe their monitoring setup but can’t explain what it actually catches
The best SREs I’ve worked with have war stories that involve reading code they didn’t write, finding bugs that weren’t theirs, and fixing them anyway.
The question I always ask
“Walk me through how you’d debug a silent failure where metrics look fine but customers are complaining.”
The bad answer: “I’d check the logs and dashboards.”
The okay answer: “I’d look at the traces, check for error rates, maybe add some debugging metrics.”
The good answer involves specific questions. What kind of complaint? Which customers? What’s the request path? Then they talk about adding instrumentation, checking specific code paths, looking at the actual data flow.
They think like a developer who happens to care about reliability. Not like someone who only looks at graphs.
What actually matters
If I had to cut your 23-bullet-point wishlist down to three:
- Can they read and modify application code in your main languages?
- Have they debugged problems that didn’t show up in monitoring?
- Can they instrument observability at the code level, not just configure tools?
Everything else can be learned. These three are hard to teach.
Want to understand what good OTel instrumentation looks like? I wrote about the real costs of getting it wrong.
— Youn