I was looking at their January observability bill. $127k. Up from $34k in October.
Series C company. They’d launched an AI feature in November. Simple RAG chatbot for customer support. Nothing crazy.
The LLM inference costs? About $8k/month. The cost to log all those prompts and completions? $89k.
They weren’t running AI. They were paying to watch it.
The math doesn’t work anymore
Traditional monitoring makes sense. You log a 200-byte JSON response. Maybe a few KB of trace data. Cost per request is negligible.
LLMs break this completely.
A single GPT-4 call with context might be 8,000 tokens. Roughly 32KB. You want to log the prompt and completion for debugging? 64KB. Add embeddings? Another 6KB each, and you might have dozens per request.
Here’s what a modest deployment looks like:
Daily requests: 100,000
Avg payload: 64KB (prompt + completion)
Daily log volume: 6.4GB
Monthly: ~192GB
Datadog logs at scale: ~$0.10/GB ingested + retention
Monthly cost: $20k+ just for LLM traces
That’s before you add embeddings. Before you add semantic search. Before you add the retrieval context.
Now run that through your Splunk pricing. I’ll wait.
Traditional APM wasn’t built for this
Your existing monitoring assumes:
- Requests complete in milliseconds
- Payloads are small
- Latency spikes are problems
LLMs violate all three.
A 10-second response isn’t an outage. It’s Tuesday. Your latency alerts fire constantly. Teams start ignoring them. I’ve seen this at maybe four or five companies now. Same pattern every time.
A 50KB payload isn’t an anomaly. It’s every single request. Your cardinality explodes. Costs go vertical.
One VP of Eng told me they’d set up standard APM tracing on their AI pipeline. Within two weeks, their trace storage was larger than their production database.
That’s not observability. That’s hoarding.
The silent failure mode
Here’s what scares me most.
With traditional services, when things break, you know. Error rates spike. Latencies jump. Alerts fire.
LLMs fail quietly.
The model starts hallucinating more. Response quality degrades. Relevance drops. Your dashboards stay green because technically everything “works.” 200 OK. Latency normal. Error rate zero.
But your users notice. They just stop using the feature. You see engagement drop weeks later and have no idea why.
You weren’t monitoring quality.
You were monitoring HTTP status codes.
What breaks first
The storage bomb. Team logs everything “just in case.” Bills explode. CFO gets involved. Team panics and turns off logging entirely. Now they’re flying blind.
The sampling disaster. Team samples to control costs. 1% of requests. Great for bills. Terrible for debugging. That weird hallucination affecting enterprise customers? Good luck reproducing it with a 1-in-100 chance of having the trace.
The alert swamp. Team applies traditional alerting to LLM endpoints. P95 latency varies by 10x based on prompt length alone. Alert fatigue within days.
What actually works
Stop logging full payloads by default. Log metadata, token counts, model versions, latencies. Store full prompts only for errors and sampled requests.
Build quality metrics that actually matter. Track semantic similarity. Monitor hallucination rates if you can. Compare outputs to ground truth where you have it.
Sample intelligently. Not random 1%. Sample based on behavior. Did they retry? Did they rage-click? Did they abandon the flow? Those are the requests worth storing.
Set realistic SLOs. A 15-second P99 isn’t a bug for some LLM calls. Know which endpoints can be slow.
Budget your observability explicitly. If inference costs $10k, you shouldn’t spend $100k to watch it. Set a ratio. Stick to it.
The uncomfortable truth
We built 20 years of monitoring best practices around assumptions LLMs violate.
The tools haven’t caught up. The pricing models haven’t caught up. Most teams are figuring this out the hard way.
If you’re running AI in production, this will hit you. Maybe not this quarter. But soon. Your observability bill will cross a line that makes someone in finance very unhappy.
Better to fix it now than explain it later.
More on getting observability right without burning money in my guide on reliability costs.
— Youn