Last month I watched a senior engineer spend 47 minutes tracking down a single failed request. Not because it was a hard bug. Because the request touched three clouds.
Start in CloudWatch. See the Lambda fired. Check Azure Monitor for the API call. Nothing there. Wait, wrong subscription. Found it. Now trace the downstream call to GCP. Open Stackdriver. Different project. Different trace ID format. Manually correlate timestamps.
47 minutes. For a timeout.
This is 2026 and we’re still here
Everyone went multi-cloud. Made sense at the time. Best tool for each job. Avoid vendor lock-in. All that.
Nobody thought about what happens when something breaks across all three.
Now you’ve got:
- AWS CloudWatch with its own trace format
- Azure Monitor with its own correlation IDs
- GCP Cloud Trace doing its own thing
- Plus whatever self-hosted stuff you’re running
Each one is fine on its own. Together? A nightmare.
MTTR is climbing. Nobody admits why.
I’ve been tracking this across my clients. Three years ago, most could find root cause in 20 minutes. Now it’s pushing 45. Sometimes an hour.
The infrastructure got more complex. The tooling didn’t keep up.
OpenTelemetry was supposed to fix this. Unified format. Vendor neutral. Ship traces anywhere.
In theory? Great.
In practice? You still end up with:
# Three different exporters, three different backends
exporters:
awsxray:
region: us-east-1
azuremonitor:
connection_string: ${AZURE_CONN}
googlecloud:
project: my-project
Now your traces are unified… and split across three UIs again.
What actually works
I’ve tried a lot of approaches. Most don’t work. Here’s what does.
Single pane of glass. For real this time.
Pick one backend. Route everything there. Yes, even if it costs more than the native tools. The time savings pay for it in a month.
exporters:
otlp:
endpoint: "your-unified-backend:4317"
service:
pipelines:
traces:
exporters: [otlp] # One destination. That's it.
Propagate the same trace ID everywhere.
This sounds obvious but I see teams mess it up constantly. Your trace ID needs to survive the AWS-to-Azure hop. The Azure-to-GCP hop. Every boundary.
# When calling across cloud boundaries
headers = {
"traceparent": f"00-{trace_id}-{span_id}-01",
# Not x-amzn-trace-id. Not x-correlation-id.
# The W3C standard. Every time.
}
Stop using native consoles for debugging.
CloudWatch is fine for AWS-only stuff. The moment you go multi-cloud, you need a tool built for that. Grafana, Honeycomb, Lightstep, whatever. Just pick one and commit.
The uncomfortable truth
Multi-cloud added complexity nobody budgeted for. The monitoring vendors are catching up. But slowly.
Until then, you’re either building this yourself or paying someone to help. There’s no magic tool that makes three clouds behave like one.
I wish there was. Would make my job easier.
I wrote more about the OTel implementation gotchas in my technical guide on OTel pitfalls. Especially the cross-cloud propagation stuff.
— Youn