Got a Slack message last Tuesday. “Finance is asking about our Datadog spend. Can you join a call tomorrow?”
I’ve had this conversation maybe 30 times in the past year. It’s always the same. Engineering thinks observability is critical. Finance sees a line item that grew 400% in 18 months. Nobody knows how to bridge that gap.
Here’s how I approach it.
Start with what actually matters
Before the meeting, I ask teams one question: “What decisions did you make last month based on this data?”
Usually there’s a long pause.
Most telemetry exists because someone thought it might be useful. Not because it drives actual decisions. That’s the real problem.
Make a list of your data and categorize it:
TIER 1: Used for on-call, pages, immediate action
TIER 2: Used for weekly reviews, capacity planning
TIER 3: "We might need it someday"
TIER 4: Nobody remembers why we collect this
Be honest. Tier 3 and 4 are usually 70% of the bill.
Tie telemetry to business value
Finance people aren’t dumb. They just need translation.
Don’t say: “We need distributed tracing for debugging.”
Say: “Last month tracing helped us find a bug that was causing 3% of checkouts to fail. That’s roughly $40k in lost revenue per week. We fixed it in 2 hours instead of 2 days.”
Now they get it. Observability isn’t a cost center. It’s insurance plus a debugging tool. But you have to prove it.
Keep a doc of incidents where observability actually helped. Date, impact, how telemetry contributed. Pull it out in budget conversations.
Intelligent sampling is your friend
You don’t need 100% of your traces. You need 100% of interesting traces.
Head-based sampling is lazy. Tail-based sampling is where it’s at.
# OTel Collector config - tail sampling
processors:
tail_sampling:
policies:
- name: errors-always
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-requests
type: latency
latency: {threshold_ms: 500}
- name: sample-rest
type: probabilistic
probabilistic: {sampling_percentage: 5}
Errors and slow requests: keep everything. Normal traffic: sample 5%. You cut 90% of trace volume while keeping what matters.
Retention policies matter
Not all data needs to live forever.
- Traces: 7 days is usually enough. Extend for errors.
- Metrics: 30 days at full resolution, downsample after.
- Logs: depends, but DEBUG logs should not hit production. Ever.
I worked with a team storing 90 days of all traces. Asked when they last looked at a trace older than a week. Blank stares.
The awkward truth
Sometimes the answer is “we’re overspending and need to cut.” That’s okay. It’s better to have the conversation proactively than wait for a mandate from above.
Pick your battles. Protect the data that actually saves you during incidents. Let go of the vanity metrics nobody looks at.
If you want to dig deeper into sampling configs and common OTel mistakes, check out my technical guide on OTel pitfalls.
— Youn