A Series C founder messaged me at the end of a month. “Our OpenAI bill is $61k. It was $9k last month. We didn’t ship anything new.”
They had shipped something. Three weeks earlier a small change let one agent call another agent under certain conditions. Most of the time it was fine. Sometimes the two agents ping-ponged, each one’s output triggering the other, until a timeout cut it off thirty minutes and several thousand tokens later.
Nobody noticed, because nobody was watching cost in real time. They were watching it the way everyone watches it now. Monthly, on the invoice, after the money’s gone.
Token spend behaves like nothing else on your bill
Your cloud costs are mostly boring and predictable. Compute scales with traffic. Storage grows slowly. You can forecast them.
LLM token spend is volatile in a way that breaks those instincts.
The cost of a single user action isn’t fixed. It depends on how many times the model retried, how big the context window grew over a long conversation, how many reasoning steps a multi-step agent took, and how much overhead your framework wrapped around each call. The same feature can cost a tenth of a cent for one user and forty cents for the next.
That’s why estimates come out wildly off. People budget “average tokens times price” and reality laughs, because the average is meaningless when the tail is this fat. Retries, context growth, and agent loops all live in the tail, and the tail is where the money goes.
The result is always the same. Teams don’t realize they overspent until the invoice arrives. By then it’s a closed month and a very awkward conversation with finance.
Why your existing monitoring won’t catch it
Your observability is built to alert on errors and latency. A runaway agent loop usually throws neither.
Each individual model call succeeds. 200 OK. Latency per call is normal. The loop is just a lot of normal calls, fast, in a row. Every signal your monitoring watches stays green while the meter spins.
It’s the green-dashboard problem again, wearing a new outfit. Everything “works.” You’re just hemorrhaging money, and the only system that knows is billing, which reports once a month.
Put the meter on the span
You already need to trace your agents to debug them. Cost rides along for almost free if you do it right.
Attach token counts and dollar cost to every model span. Input tokens, output tokens, model, and computed cost as span attributes. Now every trace carries its own price tag, “this conversation cost 3.2 cents,” and you can sum, group, and rank by it.
Roll cost up to the trace and the feature. Per-call cost is noise. Per-conversation and per-feature cost is the number that matters. When the bill jumps you want to answer “which feature, which user segment, which code path” in one query, not in a two-day forensic dig through logs.
Alert on cost, not just errors. Set thresholds on tokens-per-trace and spend-per-hour. A trace that burns 10x the median token count is a problem whether or not it errored. A single conversation that runs thousands of spans for thirty minutes is the loop that’s about to cost you $40k. Catch it live, not in arrears.
And put a hard ceiling on agent loops. Max steps, max tokens, max wall-clock per agent run, enforced in code. Observability tells you it happened. A limit stops it from happening. You want both.
The math that should bother you
For a growing number of teams, the model’s inference cost isn’t even the scary part. The scary part is that they can’t see it until it’s spent.
You instrument compute. You instrument storage. You’d never run prod without watching CPU. Token spend is now a comparable line item, and most teams are running it completely unmonitored, reconciled monthly like it’s a utility bill.
Treat tokens like the volatile real-time cost they are. Put the meter on the span. Because the alternative is finding out from finance, and finance is not a great monitoring tool.
How to watch spend without spending a fortune on the watching, the cost discipline that actually holds, is in my reliability costs breakdown.
— Youn