A Series C team called me because their AI support agent was “randomly” giving wrong answers. Maybe 1 in 20 conversations. They couldn’t reproduce it.

I asked to see a trace of a bad conversation.

They showed me a log line. POST /api/agent 200 OK, 4.2s.

That was it. That was the whole picture they had of a system that made nine model calls, three tool invocations, two retrievals, and a guardrail check before answering. They were debugging an agent with a single status code.

The unit changed and nobody sent a memo

For twenty years the unit of observability was the request. One request in, one response out, one log line, maybe a trace with a few spans. That model held up fine.

The moment you put an LLM agent in production, it breaks.

A single user message now fans out into a tree. The orchestrating call, each model invocation, every retrieval from your vector store, each tool the agent decides to call, a reranker, a guardrail decision, maybe an evaluator scoring the output. Each of those is a span. The trace isn’t a line anymore. It’s a topology.

Agent traces routinely run thousands of spans and span tens of minutes. Your “request” is a distributed system now, and it’s one you didn’t design. The model decided the control flow at runtime.

What you actually need to see

The span kinds that matter for an agent look nothing like HTTP:

AGENT      support-conversation               (28s, $0.04)
├─ LLM       plan-next-step          gpt-4o   (1.1s, 2,400 tok)
├─ RETRIEVER vector-search          "refund policy"  → 6 chunks
├─ LLM       decide-tool                     (0.8s, 1,900 tok)
├─ TOOL      lookup_order(id=...)            (220ms, ok)
├─ GUARDRAIL pii-filter                       → pass
├─ LLM       compose-answer                  (2.4s, 3,100 tok)
└─ EVALUATOR answer-relevance                 → 0.42  ⚠️

Now the “random” failure isn’t random. The retriever pulled the wrong chunks, so the answer was confidently wrong, and the evaluator score on those conversations was 0.42 instead of 0.9. You can see it. You can filter for it. You can fix the retrieval.

With a 200 OK you can’t see any of that. The HTTP call succeeded. That was never the question.

The two things teams get wrong

First, they instrument the HTTP layer and stop. Their APM wraps the endpoint, sees 4.2 seconds, and calls it traced. But the interesting failures live three layers down, in which chunks got retrieved, which tool got picked, what the model actually saw. If your tracing stops at the request boundary, you’re blind to everything that makes an agent an agent.

Second, they capture mechanics but not quality. Latency and token counts are easy, everybody grabs those. But an agent fails by being wrong, not by being slow. You need eval scores attached to spans. Relevance, groundedness, whether the tool call matched intent. Good setups write the eval result onto the span itself, server-side, so it costs no request latency. If you’re only watching latency and status codes, every failure looks like a success.

How to not paint yourself into a corner

Instrument with OpenTelemetry-compatible tracing from day one, even for a “simple” agent. The simple agent becomes a multi-step one in about six weeks, and retrofitting tracing onto a live agent is miserable.

Capture the tree, not the endpoint. Every model call, retrieval, and tool invocation is a span with the inputs that explain the decision. The query, the retrieved chunk IDs, the chosen tool. Sample full payloads, because they’re huge, but never sample away the structure.

Attach quality scores to spans, not to a separate dashboard nobody opens. The whole point is being able to filter “show me the conversations where relevance dropped below 0.6” in one query.

So where does that leave you

If you’ve shipped an AI feature and your answer to “show me what happened in this conversation” is a status code and a latency number, you don’t have observability. You have a smoke detector that only goes off after the house is gone.

The request log was the right unit for the last era. It’s the wrong one now.


The instrumentation patterns, what to capture, what to sample, how to not blow up your bill doing it, are in my OTel pitfalls guide.

— Youn