OTel profiling is the fourth signal now

For years observability meant three things. Traces, metrics, logs. The “three pillars.” Every vendor diagram, every conference talk.

There are four now.

OpenTelemetry made continuous profiling an official signal. It hit public Alpha back in March, went to release candidate, and it’s targeting GA around Q3. The question landing in my inbox from engineering leaders is simple. Do we need this, and do we need it now?

Here’s the honest answer.

What profiling actually is

Traces tell you which service was slow. Metrics tell you that something is slow. Logs tell you what happened.

None of them tell you which line of code is burning the CPU.

Continuous profiling does. It samples what your processes are actually executing, which functions, holding which CPU, allocating which memory, continuously, in production, all the time. Not a one-off flame graph you capture during an incident. An always-on recording you can rewind to.

The OTel implementation is built on opentelemetry-ebpf-profiler, the eBPF agent Elastic donated from its Universal Profiling product. It runs as a Collector receiver, profiles the whole system on Linux with essentially zero instrumentation, and does automatic symbolization for Go. No code changes, no SDK, no redeploy.

That last part is the big deal. Whole-fleet profiling used to mean either an expensive vendor or hand-rolling perf tooling. Now it’s a Collector component.

Where it earns its keep

Profiling pays off in two specific situations, and you probably have at least one.

The CPU bill you can’t explain. You’re spending real money on compute and nobody can say which code is responsible. Profiling points straight at the hot functions. Teams routinely find a single inefficient serialization path or a regex in a loop eating 20-30% of a service’s CPU. That’s a direct cloud-cost line item, found in an afternoon.

The latency a trace can’t explain. Your trace says a span took 800ms inside one service. Inside. No downstream call. The trace is a black box past the service boundary, and profiling is how you see in. Match a slow trace to a flame graph at the same timestamp and the mystery 800ms turns into “we’re spending it in JSON parsing.”

If you’ve ever said “the trace shows the time is in the service but we don’t know where,” profiling is the missing piece.

The honest case for waiting

It’s Alpha-to-RC, targeting GA in Q3. That means a few things.

APIs and semantic conventions can still shift. If you build heavy automation on it today, expect some churn.

It’s another data type, another Collector component to operate, and another thing that produces volume you pay to store. You’re already fighting your observability bill, so don’t bolt on a fourth signal you won’t look at.

And it’s strongest on Linux, with Go symbolization the most mature. Mixed-runtime fleets get more uneven results today.

If you’re a 30-engineer shop whose traces and metrics are still a mess, fix those first. Profiling is the fourth thing to get right, not the first. A fourth signal on top of three broken ones is just more noise with a nicer flame graph.

What I’d actually do

Don’t put it on the critical path yet. Do run a time-boxed spike.

Stand up the eBPF profiler against your three most expensive or most latency-sensitive services. Leave it for two weeks. Look at where CPU actually goes. Almost every team finds at least one embarrassing hot path worth real money, and that one finding usually pays for the experiment many times over.

Then wait for GA before you make it a standing part of your stack. Adopt the value now, commit to the tooling once it’s stable.

The three pillars were always an incomplete picture. The fourth one is finally production-grade enough to take seriously. Just take it seriously in the right order.

If your traces and metrics aren’t solid yet, start there. I wrote up the most common ways teams get OpenTelemetry wrong in my OTel pitfalls guide.

— Youn