A few years back I was leading platform at a company that had grown to about 50 microservices. On paper, we did everything right. OpenTelemetry instrumented. Centralized logging. Distributed tracing. The whole stack.
In practice? Debugging anything was still a nightmare.
Here’s what actually happens at scale that the vendor demos don’t show you.
Context propagation dies at the message queue
This one hurt. We had beautiful traces through our synchronous HTTP calls. Then a request would hit Kafka and poof. Gone. New trace ID on the other side.
Turns out, propagating trace context through async messaging isn’t automatic. You have to explicitly handle it. And every service that touches the queue needs to do it the same way.
We had three teams writing Kafka consumers. Three different approaches to context propagation. Zero end-to-end traces through async flows.
Took us two months to standardize. Two months of debugging distributed transactions by grep-ing correlation IDs across services.
Everyone instruments differently
In theory, OpenTelemetry gives you a standard. In practice, developers make choices.
One team adds user_id to every span. Another team calls it userId. A third team puts it in the resource attributes instead of span attributes.
Now your dashboards are useless unless you know which service follows which convention.
We eventually wrote linting rules for our instrumentation. Sounds overkill until you’ve spent an afternoon wondering why your queries return nothing for half your services.
”Observability” becomes 8 dashboards nobody looks at
I’ve seen this pattern everywhere. Team gets excited about observability. Builds dashboards. Then builds more dashboards. Then builds dashboards for the dashboards.
Six months later, there’s a war room incident and everyone’s staring at Grafana trying to remember which dashboard has the thing they need.
The fix is brutal but necessary: delete most of them. Keep three or four that actually matter. If nobody looked at a dashboard in 30 days, it’s noise.
The collector becomes a bottleneck nobody monitors
This is my favorite irony. You build an observability pipeline to monitor your system. But who monitors the observability pipeline?
At scale, your OTel collector is processing serious volume. It can drop data. It can fall behind. It can OOM. And when it does, you won’t know because the thing that would tell you is the thing that’s broken.
We had a production incident once where the collector was dropping 40% of spans due to memory pressure. Took us a week to notice. A week of incomplete traces and misleading metrics.
What actually works
After all that, here’s what I’d tell past me:
- Standardize before you scale. Instrumentation conventions, context propagation patterns, attribute naming. Do it early.
- Monitor your monitoring. Seriously. Collector health, pipeline latency, drop rates. Treat it like production infrastructure because it is.
- Less is more on dashboards. Start with alerts on business outcomes. Add dashboards when you find yourself needing them in incidents.
- Async is different. Budget real time for getting context propagation right through queues and event systems.
None of this is revolutionary. But I still see teams making the same mistakes we did. Usually because nobody warned them.
If you’re hitting these walls, I wrote up some specifics in my OTel pitfalls guide. Might save you a few months.
— Youn