I’ve sat through a lot of SOC2 audits at this point. And I keep seeing the same thing: engineering teams who think observability is about debugging get blindsided when auditors start asking questions.

Here’s the thing. SOC2 doesn’t care about your dashboards. It cares about one question: can you prove you’d know if something went wrong?

What auditors actually look for

The compliance frameworks are written in bureaucrat-speak, but it boils down to three things:

Detection - Can you demonstrate you’d notice a data breach or service failure?
Response - When you notice, do you have a documented process?
Evidence - Can you pull logs showing what happened and when?

That last one trips up most teams. You might have great alerting. But if you can’t produce an audit trail six months later showing exactly who accessed what and when, that’s a finding.

The MTTR problem

If your mean time to resolution is over an hour, auditors will ask why. And “we’re working on it” isn’t an answer they accept.

I worked with a fintech last year that had solid monitoring. Looked good on paper. Then the auditor asked: “Show me how you’d detect if someone exfiltrated customer data.”

Silence.

They had APM. They had logging. They had metrics. But nothing was actually configured to detect that scenario. No alerting on unusual data access patterns. No anomaly detection on egress. Nothing.

That became a finding. And findings mean extra work, remediation timelines, and uncomfortable conversations with your board.

Observability is compliance infrastructure

This is the mental shift I try to get teams to make. Your tracing isn’t just for debugging slow endpoints. It’s evidence. Your alerting isn’t just for on-call. It’s proof of detection capability.

When you think about it that way, the requirements change:

Retention matters. 90 days minimum, often longer.
Completeness matters. Gaps in coverage become audit risks.
Consistency matters. “We log things differently in each service” is a red flag.

What to do about it

Before your next audit, walk through these scenarios:

Could you prove when a specific user’s data was last accessed?
Would you detect an engineer accessing production data they shouldn’t?
Can you show a timeline of any incident from the past year?

If you hesitated on any of those, you’ve got work to do.

The good news: this is solvable. The bad news: it usually means instrumenting things you’ve been putting off.

I cover the technical side of getting this right in my guide on reliability costs. It’s less about tools and more about what actually needs to happen.

— Youn