I did an architecture review last month for a Series B company. 47 engineers. $30M raised. Growing 40% quarter over quarter.

Their infrastructure diagram looked like spaghetti thrown at a wall.

They started with a Rails monolith. Reasonable choice. Then they hit scaling problems, so they extracted a few services. Then a few more. Then someone new joined and built a Go service because “we need performance.” Then another team shipped a Node service for real-time stuff.

Now they have 23 microservices. Nobody can draw how they connect from memory. Not even the CTO.

The “it works” phase is over

Here’s what happens at every fast-growing startup:

Phase 1: You ship fast. Things work. Revenue grows.

Phase 2: Things still mostly work. Occasional weird bugs. “Probably a race condition.” Ship it anyway.

Phase 3: Random failures. 3am pages. “The checkout is slow but only on Tuesdays.” Engineers spend 40% of their time debugging instead of building.

You’re in phase 3 now. I can tell because you’re reading this.

The real problem

You have dashboards. Lots of them. CPU usage, memory, request latency per service. Green checkmarks everywhere.

But when something breaks, you don’t know where. A user reports slow checkout. Is it the cart service? The payment service? The inventory check? The third-party fraud API? All of them?

You check each dashboard one by one. You grep through logs. You Slack the team who owns each service. Three hours later you find it was a slow database query in a service two hops away from where you started looking.

Individual service health doesn’t matter. Request flow matters. The path a user request takes through your system. The dependency chain. The thing that actually broke.

This is when observability stops being optional

At 5 services, you can hold the architecture in your head. At 25, you can’t. Nobody can.

You need traces. Real distributed traces that show you the entire journey of a request. Where it went, what it called, how long each step took, where it failed.

Not logs. Logs are forensic evidence after the crime. You need to see the crime happening in real time.

Not metrics alone. Metrics tell you something is wrong. Traces tell you what and where.

I’ve seen teams go from 3-hour debugging sessions to 10-minute fixes after instrumenting properly. The investment pays for itself in the first month.

The uncomfortable truth

Your infrastructure isn’t broken because you made bad decisions. You made reasonable decisions under pressure. Ship fast, fix later. That’s startup life.

But “later” is now. The duct tape is showing. The random failures are costing you customers and burning out your team.

The fix isn’t a rewrite. It’s visibility. You can’t fix what you can’t see.

I wrote a detailed breakdown of common observability traps at OTel pitfalls. Worth reading before you start instrumenting.

— Youn