<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Ynden — Observability &amp; Reliability</title>
    <link>https://ynden.tech/blog/</link>
    <atom:link href="https://ynden.tech/rss.xml" rel="self" type="application/rss+xml" />
    <description>Notes on observability, reliability, and the stuff that breaks at 3am.</description>
    <language>en-us</language>
    <item>
      <title>AI SRE is a Gartner category now</title>
      <link>https://ynden.tech/blog/ai-sre-is-a-gartner-category-now/</link>
      <guid>https://ynden.tech/blog/ai-sre-is-a-gartner-category-now/</guid>
      <description>Gartner published its first AI SRE Market Guide in January. The tech is real. The way most teams plan to deploy it will turn a blip into an outage.</description>
      <pubDate>Thu, 18 Jun 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Token spend is a cloud line item now</title>
      <link>https://ynden.tech/blog/token-spend-is-a-cloud-line-item/</link>
      <guid>https://ynden.tech/blog/token-spend-is-a-cloud-line-item/</guid>
      <description>An agent stuck in a retry loop can burn $40k overnight, and you&apos;ll find out from the invoice. Put the meter on the span.</description>
      <pubDate>Fri, 12 Jun 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>OTel won. Now the hard part.</title>
      <link>https://ynden.tech/blog/otel-won-now-the-hard-part/</link>
      <guid>https://ynden.tech/blog/otel-won-now-the-hard-part/</guid>
      <description>OpenTelemetry graduated from the CNCF in May. The standards war is over. Operating the Collector is the new problem.</description>
      <pubDate>Fri, 05 Jun 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>eBPF for breadth, SDKs for depth</title>
      <link>https://ynden.tech/blog/ebpf-for-breadth-sdks-for-depth/</link>
      <guid>https://ynden.tech/blog/ebpf-for-breadth-sdks-for-depth/</guid>
      <description>Beyla became an official OpenTelemetry project. Zero-touch instrumentation is real now, but it doesn&apos;t replace your SDKs and the vendor who built it says so.</description>
      <pubDate>Thu, 21 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>AI amplifies your worst pipeline</title>
      <link>https://ynden.tech/blog/ai-amplifies-your-worst-pipeline/</link>
      <guid>https://ynden.tech/blog/ai-amplifies-your-worst-pipeline/</guid>
      <description>The DORA 2025 data is blunt about it. AI raised your throughput and your change-failure risk at the same time.</description>
      <pubDate>Thu, 07 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Store everything, aggregate nothing</title>
      <link>https://ynden.tech/blog/store-everything-aggregate-nothing/</link>
      <guid>https://ynden.tech/blog/store-everything-aggregate-nothing/</guid>
      <description>Observability 2.0 isn&apos;t a vibe. ClickHouse runs a 100PB stack on it and published the CPU math to prove it.</description>
      <pubDate>Wed, 22 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>OTel profiling is the fourth signal now</title>
      <link>https://ynden.tech/blog/otel-profiling-fourth-signal/</link>
      <guid>https://ynden.tech/blog/otel-profiling-fourth-signal/</guid>
      <description>Continuous profiling just became an official OpenTelemetry signal. The honest buy-or-wait call for a Series B-D team.</description>
      <pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>You&apos;re not logging requests anymore. You&apos;re tracing agents.</title>
      <link>https://ynden.tech/blog/youre-not-logging-requests-youre-tracing-agents/</link>
      <guid>https://ynden.tech/blog/youre-not-logging-requests-youre-tracing-agents/</guid>
      <description>Ship an LLM agent and the request log stops being the unit of observability. The trace tree is. Most teams find out the hard way.</description>
      <pubDate>Tue, 24 Mar 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Config took down the internet twice in 2025</title>
      <link>https://ynden.tech/blog/cloudflare-aws-outages-config-lessons/</link>
      <guid>https://ynden.tech/blog/cloudflare-aws-outages-config-lessons/</guid>
      <description>Cloudflare and AWS didn&apos;t fall over from too much traffic. A bad config and an empty DNS record did it.</description>
      <pubDate>Fri, 13 Mar 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Your Datadog logs are billed twice</title>
      <link>https://ynden.tech/blog/datadog-logs-billed-twice/</link>
      <guid>https://ynden.tech/blog/datadog-logs-billed-twice/</guid>
      <description>You pay to send the log. Then you pay again to make it searchable. That&apos;s where the bill shock comes from.</description>
      <pubDate>Wed, 04 Mar 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Why your platform team shipped nothing this quarter</title>
      <link>https://ynden.tech/blog/why-your-platform-team-shipped-nothing/</link>
      <guid>https://ynden.tech/blog/why-your-platform-team-shipped-nothing/</guid>
      <description>Fourteen months, three engineers, a 400-line golden path two teams use. Platforms fail on politics, not tech, when you optimize for the 20% edge cases.</description>
      <pubDate>Tue, 17 Feb 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Your next outage will cost more than your Series B</title>
      <link>https://ynden.tech/blog/outage-costs-more-than-series-b/</link>
      <guid>https://ynden.tech/blog/outage-costs-more-than-series-b/</guid>
      <description>A 4-hour outage knocked $8M off a term sheet because nobody had the number. The hidden downtime costs that never show up in your MTTR report.</description>
      <pubDate>Mon, 16 Feb 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Your LLM costs more to observe than to run</title>
      <link>https://ynden.tech/blog/llm-costs-more-to-observe/</link>
      <guid>https://ynden.tech/blog/llm-costs-more-to-observe/</guid>
      <description>One team paid $8k to run a RAG chatbot and $89k to log it. AI observability breaks traditional APM. Here&apos;s how to instrument LLMs without going broke.</description>
      <pubDate>Sun, 15 Feb 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Why your team can&apos;t kill Datadog</title>
      <link>https://ynden.tech/blog/why-your-team-cant-kill-datadog/</link>
      <guid>https://ynden.tech/blog/why-your-team-cant-kill-datadog/</guid>
      <description>One team tried to kill Datadog and ended up with Datadog, Grafana Cloud, and New Relic, 40% over budget. Tool sprawl is a leadership problem, not a tech one.</description>
      <pubDate>Sat, 14 Feb 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Your SREs aren&apos;t burned out. They&apos;re gone.</title>
      <link>https://ynden.tech/blog/your-sres-arent-burned-out-theyre-gone/</link>
      <guid>https://ynden.tech/blog/your-sres-arent-burned-out-theyre-gone/</guid>
      <description>On-call doesn&apos;t just burn SREs out, it pushes them into PM roles for a 40% bump and no pages. Each departure costs $150k+. Fixing on-call costs less.</description>
      <pubDate>Fri, 13 Feb 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Your SRE search is taking too long</title>
      <link>https://ynden.tech/blog/sre-search-taking-too-long/</link>
      <guid>https://ynden.tech/blog/sre-search-taking-too-long/</guid>
      <description>Hiring a senior SRE who can read code takes 3-6 months and $200k+. Here&apos;s what to do about the 90-day gap while you search for that unicorn.</description>
      <pubDate>Thu, 12 Feb 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Your observability bill is lying to you</title>
      <link>https://ynden.tech/blog/observability-bill-lying-to-you/</link>
      <guid>https://ynden.tech/blog/observability-bill-lying-to-you/</guid>
      <description>One user_id tag took a Datadog bill from $2k to $47k. High-cardinality tags multiply your metric series. Here&apos;s how to set an observability budget.</description>
      <pubDate>Tue, 10 Feb 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>OTel broke our prod. Again.</title>
      <link>https://ynden.tech/blog/otel-broke-our-prod-again/</link>
      <guid>https://ynden.tech/blog/otel-broke-our-prod-again/</guid>
      <description>An OTel SDK bump from 1.28 to 2.1 caused a 3-hour outage. Why OpenTelemetry upgrades silently break prod and how to test the upgrade path first.</description>
      <pubDate>Sun, 08 Feb 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The observability budget conversation nobody wants to have</title>
      <link>https://ynden.tech/blog/observability-budget-conversation/</link>
      <guid>https://ynden.tech/blog/observability-budget-conversation/</guid>
      <description>Finance flagged your Datadog spend? Tie telemetry to revenue, use tail-based sampling, and cut retention. How to defend the budget that actually matters.</description>
      <pubDate>Thu, 05 Feb 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What to actually look for in an SRE hire</title>
      <link>https://ynden.tech/blog/what-to-look-for-sre-hire/</link>
      <guid>https://ynden.tech/blog/what-to-look-for-sre-hire/</guid>
      <description>Most SRE postings are 23-bullet wishlists nobody meets. The hire that matters can read your app code and instrument OTel, not just move YAML.</description>
      <pubDate>Tue, 03 Feb 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The 3 AM page that shouldn&apos;t have happened</title>
      <link>https://ynden.tech/blog/3am-page-that-shouldnt-have-happened/</link>
      <guid>https://ynden.tech/blog/3am-page-that-shouldnt-have-happened/</guid>
      <description>Most alerts page you about CPU and memory, not whether checkout works. Switch to SLO-based alerting and stop losing sleep over garbage.</description>
      <pubDate>Sun, 01 Feb 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The OTel usability gap nobody fixed</title>
      <link>https://ynden.tech/blog/otel-usability-gap-nobody-fixed/</link>
      <guid>https://ynden.tech/blog/otel-usability-gap-nobody-fixed/</guid>
      <description>A solid engineer spent three weeks adding tracing and still failed. OpenTelemetry nailed the hard tech but the usability is rough. Start with one path.</description>
      <pubDate>Wed, 28 Jan 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Your best engineers are stuck firefighting</title>
      <link>https://ynden.tech/blog/your-best-engineers-are-stuck-firefighting/</link>
      <guid>https://ynden.tech/blog/your-best-engineers-are-stuck-firefighting/</guid>
      <description>Your senior engineers burn 30% of their time on incidents because the system needs tribal knowledge to debug. That&apos;s $50k-$75k each, in salary alone.</description>
      <pubDate>Sun, 25 Jan 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The hero problem</title>
      <link>https://ynden.tech/blog/the-hero-problem/</link>
      <guid>https://ynden.tech/blog/the-hero-problem/</guid>
      <description>Your 10x engineer is a single point of failure. When the one who holds the system in their head leaves, 20-minute fixes become 4-hour outages.</description>
      <pubDate>Thu, 22 Jan 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Five dashboards to debug one request</title>
      <link>https://ynden.tech/blog/five-dashboards-to-debug-one-request/</link>
      <guid>https://ynden.tech/blog/five-dashboards-to-debug-one-request/</guid>
      <description>Multi-cloud observability means jumping CloudWatch, Azure Monitor, and GCP to chase one request. Here&apos;s how to get a real single pane of glass.</description>
      <pubDate>Tue, 20 Jan 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The war room is burning money</title>
      <link>https://ynden.tech/blog/the-war-room-is-burning-money/</link>
      <guid>https://ynden.tech/blog/the-war-room-is-burning-money/</guid>
      <description>Eight senior engineers in a war room for four hours runs $4,500 a pop, and most teams do it weekly. Distributed tracing turns that into 30 seconds.</description>
      <pubDate>Sun, 18 Jan 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Alert fatigue killed our on-call</title>
      <link>https://ynden.tech/blog/alert-fatigue-killed-our-oncall/</link>
      <guid>https://ynden.tech/blog/alert-fatigue-killed-our-oncall/</guid>
      <description>82% of orgs have MTTR over an hour and noise is the reason. Kill threshold alerts, switch to SLO burn-rate alerting, and let on-call sleep again.</description>
      <pubDate>Thu, 15 Jan 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Series B infrastructure is held together with duct tape</title>
      <link>https://ynden.tech/blog/series-b-infrastructure-duct-tape/</link>
      <guid>https://ynden.tech/blog/series-b-infrastructure-duct-tape/</guid>
      <description>You scaled to 23 microservices and nobody can draw how they connect. The fix for Series B observability isn&apos;t a rewrite, it&apos;s distributed tracing.</description>
      <pubDate>Mon, 12 Jan 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>eBPF isn&apos;t magic (but it&apos;s close)</title>
      <link>https://ynden.tech/blog/ebpf-isnt-magic/</link>
      <guid>https://ynden.tech/blog/ebpf-isnt-magic/</guid>
      <description>Two years of eBPF observability in prod: the ARM64 breakage, the BPFDoor security risk, and when traditional OTel instrumentation just wins.</description>
      <pubDate>Sat, 10 Jan 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The board doesn&apos;t care about your uptime percentage</title>
      <link>https://ynden.tech/blog/board-doesnt-care-about-uptime/</link>
      <guid>https://ynden.tech/blog/board-doesnt-care-about-uptime/</guid>
      <description>Your board hears nines, they think dollars. Learn to report downtime as lost revenue so you actually get observability budget approved.</description>
      <pubDate>Thu, 08 Jan 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Why your MTTR got worse in 2026</title>
      <link>https://ynden.tech/blog/why-your-mttr-got-worse/</link>
      <guid>https://ynden.tech/blog/why-your-mttr-got-worse/</guid>
      <description>82% of companies now take over an hour to resolve incidents. More tools made you slower. The fix isn&apos;t adding a tool, it&apos;s removing three.</description>
      <pubDate>Mon, 05 Jan 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Your SOC2 audit is going to find your observability gaps</title>
      <link>https://ynden.tech/blog/soc2-audit-observability-gaps/</link>
      <guid>https://ynden.tech/blog/soc2-audit-observability-gaps/</guid>
      <description>SOC2 auditors don&apos;t care about dashboards. They want proof you&apos;d detect a data breach, an audit trail 6 months back, and an MTTR you can explain.</description>
      <pubDate>Fri, 02 Jan 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What I learned managing 50+ microservices</title>
      <link>https://ynden.tech/blog/managing-50-microservices/</link>
      <guid>https://ynden.tech/blog/managing-50-microservices/</guid>
      <description>Trace context dies at Kafka, everyone names attributes differently, and the OTel collector silently drops spans. Hard lessons from 50+ microservices.</description>
      <pubDate>Sun, 28 Dec 2025 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Why your dashboards lie to you</title>
      <link>https://ynden.tech/blog/why-your-dashboards-lie/</link>
      <guid>https://ynden.tech/blog/why-your-dashboards-lie/</guid>
      <description>Your payment service can return 200 OK with a failed body. All green dashboards, 40 support tickets. Instrument business outcomes, not HTTP calls.</description>
      <pubDate>Sat, 20 Dec 2025 00:00:00 GMT</pubDate>
    </item>
  </channel>
</rss>
