3 OTel Pitfalls That Kill EKS Observability
You deployed OpenTelemetry. Traces are flowing. Dashboards look good. Then three months later your Datadog bill is 10x what you budgeted, half your traces are broken, and you still can't figure out why checkout failed for that one customer in Germany.
I've seen this happen at multiple Series B-D companies. The pattern is always the same. Here's what goes wrong and how to actually fix it.
1. The High-Cardinality Cost Trap
What happens
Your engineers are diligent. They tag everything. user_id, order_id,
session_id, request_id. Makes sense, right? You want to trace
individual requests.
Except now you have millions of unique tag combinations. Datadog (or whatever vendor) charges per unique metric series. That $2k/month bill becomes $20k. I've seen $50k+ surprises.
The worst part? Nobody realizes until finance flags it 45 days later.
The fix
Use the OTel Collector's Attributes Processor to scrub or hash high-cardinality tags at the edge, before they hit your vendor.
processors:
attributes:
actions:
- key: user_id
action: hash # keeps correlation, kills cardinality
- key: order_id
action: delete # or just drop it entirely
- key: session_id
action: hash Run this in your collector pipeline. Your traces still correlate (hashed IDs match), but you're not paying per-user pricing to your observability vendor.
2. Context Propagation Black Holes
What happens
You instrument your Go API. You instrument your Python workers. You look at a trace and... it's three separate fragments. The trace ID dies somewhere between services.
Usually it's one of these:
- Load balancer strips the headers
- Message queue (SQS, Kafka) doesn't propagate context
- Someone's using a raw
http.Clientinstead of the instrumented one - The Python service uses a different propagation format than Go
You end up with "traces" that show 20% of what actually happened. Useless for debugging.
The fix
Standardize on W3C Trace Context headers everywhere. Then audit every network boundary.
# Check your ALB/nginx - make sure these headers pass through:
# traceparent: 00-{trace_id}-{span_id}-{flags}
# tracestate: (optional vendor-specific)
# For message queues, inject context into message attributes:
def publish_message(queue, payload, span):
carrier = {}
inject(carrier) # puts traceparent into carrier
message_attributes = {
'traceparent': carrier.get('traceparent')
}
queue.send(payload, attributes=message_attributes) The unglamorous truth: you have to manually verify every single service boundary. There's no magic "auto-instrument everything" that actually works across a real distributed system.
3. The "200 OK" Illusion
What happens
Your monitoring says everything is fine. All services returning 200. Latency is normal. Error rate is 0.01%.
Meanwhile, customers are complaining that checkout is broken. Support tickets are piling up. Your team is in a war room trying to figure out what's wrong.
Turns out the payment service returns 200 OK with {"status": "failed", "reason": "insufficient_funds"}.
Or the inventory service returns 200 but an empty array because the database query timed out silently.
Or the email service returns 200 but the email never actually sent.
HTTP status codes don't validate business logic. Your monitoring is lying to you.
The fix
Instrument business logic spans, not just HTTP spans. Use OTel semantic conventions to track actual outcomes.
// Don't just trace the HTTP call
// Trace the business outcome
func ProcessCheckout(ctx context.Context, order Order) error {
ctx, span := tracer.Start(ctx, "checkout.process")
defer span.End()
// ... do the work ...
// Record the BUSINESS outcome, not just "it didn't crash"
span.SetAttributes(
attribute.Bool("checkout.successful", result.Success),
attribute.String("checkout.failure_reason", result.FailureReason),
attribute.Float64("checkout.amount", order.Total),
attribute.Int("checkout.item_count", len(order.Items)),
)
if !result.Success {
span.SetStatus(codes.Error, result.FailureReason)
}
return nil
}
Now you can alert on checkout.successful == false regardless of HTTP status.
Your dashboard shows actual business health, not infrastructure health.
The real problem
These aren't "gotchas" you learn once and forget. They're ongoing maintenance.
Every new service needs proper context propagation. Every new engineer needs to know
not to tag with user_id. Every new feature needs business logic spans.
Most teams don't have someone who's done this across 50+ microservices. They're figuring it out as they go, making expensive mistakes along the way.
If you're scaling on EKS and your observability is more pain than help, I do a 3-day audit that maps exactly where these gaps are in your system. No fluff, just a ranked list of what to fix and how.
Book a 30-min call if you want to talk through it.
— Youn