Technical Brief

3 OTel Pitfalls That Kill EKS Observability

By Youn Den · 8 min read

You deployed OpenTelemetry. Traces are flowing. Dashboards look good. Then three months later your Datadog bill is 10x what you budgeted, half your traces are broken, and you still can't figure out why checkout failed for that one customer in Germany.

I've seen this happen at multiple Series B-D companies. The pattern is always the same. Here's what goes wrong and how to actually fix it.

1. The High-Cardinality Cost Trap

What happens

Your engineers are diligent. They tag everything. user_id, order_id, session_id, request_id. Makes sense, right? You want to trace individual requests.

Except now you have millions of unique tag combinations. Datadog (or whatever vendor) charges per unique metric series. That $2k/month bill becomes $20k. I've seen $50k+ surprises.

The worst part? Nobody realizes until finance flags it 45 days later.

The fix

Use the OTel Collector's Attributes Processor to scrub or hash high-cardinality tags at the edge, before they hit your vendor.

processors:
  attributes:
    actions:
      - key: user_id
        action: hash    # keeps correlation, kills cardinality
      - key: order_id
        action: delete  # or just drop it entirely
      - key: session_id
        action: hash

Run this in your collector pipeline. Your traces still correlate (hashed IDs match), but you're not paying per-user pricing to your observability vendor.

2. Context Propagation Black Holes

What happens

You instrument your Go API. You instrument your Python workers. You look at a trace and... it's three separate fragments. The trace ID dies somewhere between services.

Usually it's one of these:

Load balancer strips the headers
Message queue (SQS, Kafka) doesn't propagate context
Someone's using a raw http.Client instead of the instrumented one
The Python service uses a different propagation format than Go

You end up with "traces" that show 20% of what actually happened. Useless for debugging.

The fix

Standardize on W3C Trace Context headers everywhere. Then audit every network boundary.

# Check your ALB/nginx - make sure these headers pass through:
# traceparent: 00-{trace_id}-{span_id}-{flags}
# tracestate: (optional vendor-specific)

# For message queues, inject context into message attributes:
def publish_message(queue, payload, span):
    carrier = {}
    inject(carrier)  # puts traceparent into carrier
    message_attributes = {
        'traceparent': carrier.get('traceparent')
    }
    queue.send(payload, attributes=message_attributes)

The unglamorous truth: you have to manually verify every single service boundary. There's no magic "auto-instrument everything" that actually works across a real distributed system.

3. The "200 OK" Illusion

What happens

Your monitoring says everything is fine. All services returning 200. Latency is normal. Error rate is 0.01%.

Meanwhile, customers are complaining that checkout is broken. Support tickets are piling up. Your team is in a war room trying to figure out what's wrong.

Turns out the payment service returns 200 OK with {"status": "failed", "reason": "insufficient_funds"}. Or the inventory service returns 200 but an empty array because the database query timed out silently. Or the email service returns 200 but the email never actually sent.

HTTP status codes don't validate business logic. Your monitoring is lying to you.

The fix

Instrument business logic spans, not just HTTP spans. Use OTel semantic conventions to track actual outcomes.

// Don't just trace the HTTP call
// Trace the business outcome

func ProcessCheckout(ctx context.Context, order Order) error {
    ctx, span := tracer.Start(ctx, "checkout.process")
    defer span.End()

    // ... do the work ...

    // Record the BUSINESS outcome, not just "it didn't crash"
    span.SetAttributes(
        attribute.Bool("checkout.successful", result.Success),
        attribute.String("checkout.failure_reason", result.FailureReason),
        attribute.Float64("checkout.amount", order.Total),
        attribute.Int("checkout.item_count", len(order.Items)),
    )

    if !result.Success {
        span.SetStatus(codes.Error, result.FailureReason)
    }

    return nil
}

Now you can alert on checkout.successful == false regardless of HTTP status. Your dashboard shows actual business health, not infrastructure health.

The real problem

These aren't "gotchas" you learn once and forget. They're ongoing maintenance. Every new service needs proper context propagation. Every new engineer needs to know not to tag with user_id. Every new feature needs business logic spans.

Most teams don't have someone who's done this across 50+ microservices. They're figuring it out as they go, making expensive mistakes along the way.

If you're scaling on EKS and your observability is more pain than help, I do a 3-day audit that maps exactly where these gaps are in your system. No fluff, just a ranked list of what to fix and how.

Book a 30-min call if you want to talk through it.

— Youn