We had a 3-hour outage last Tuesday. The root cause? An OpenTelemetry SDK upgrade from 1.28 to 2.1.
Nobody saw it coming. The config was the same. The code was the same. But production traces just… stopped.
Too many knobs
OTel in 2026 has configuration options for everything. Batch sizes, export intervals, resource attributes, propagators, samplers, exporters. It’s flexible. It’s also a minefield.
Here’s what worked in 1.x:
exporters:
otlp:
endpoint: "collector.internal:4317"
headers:
authorization: "Bearer ${OTEL_TOKEN}"
compression: gzip
Here’s what 2.x wants:
exporters:
otlp/traces:
endpoint: "https://collector.internal:4317"
headers:
authorization: "Bearer ${OTEL_TOKEN}"
compression: gzip
protocol: grpc
Spot the differences? The endpoint needs a scheme now. The exporter key changed. There’s a new protocol field that’s suddenly required.
These aren’t breaking changes according to the changelog. They’re “improvements to configuration clarity.”
The enterprise problem
Big companies move slow. Intentionally. Their OTel collector version is probably 6 months old. Their SDK might be a year behind.
Meanwhile, the OTel project ships releases every few weeks. New features, new conventions, new defaults.
So you end up with SDK version 2.1 trying to talk to a collector running 0.88. Or your Jaeger backend doesn’t understand the new span format. Or your metric names don’t match because semantic conventions changed again.
I spent two days last month debugging why spans weren’t connecting. Turns out the trace ID format changed subtly between library versions. The collector was dropping spans because it couldn’t correlate them.
Nobody tests the upgrade path
OTel has great documentation for getting started. Fresh installs work fine. But upgrades? You’re on your own.
There’s no “here’s what changed between 1.x and 2.x that might break your stuff” guide. You piece it together from GitHub issues and Discord threads.
I’ve started keeping a personal changelog. Every config change, every breaking behavior, every workaround. It’s ugly but it’s saved me twice already.
What I do now
Before any OTel upgrade:
- Run the new version in shadow mode first. Send traces to both old and new collectors.
- Check the semantic conventions changelog. Yes, it exists. No, nobody reads it.
- Test with your actual collector version, not the latest.
- Keep the old config around for at least a week.
The upgrade from 1.x to 2.x was supposed to be straightforward. Three hours of downtime says otherwise.
I wrote more about common OTel traps in my technical guide on OTel pitfalls.
— Youn