Two of the biggest outages of 2025 had nothing to do with load.

Not a traffic spike. Not a DDoS. Not a capacity problem. Both were self-inflicted, by data that was technically valid and a system that trusted it too much.

If you run infrastructure at Series B-D scale, these two postmortems are the cheapest reliability lessons you’ll get all year. So steal from them.

Cloudflare, November 18

A routine permissions change on a ClickHouse cluster caused a query to start returning duplicate rows. That query generated Cloudflare’s Bot Management feature file, the config their proxy loads to score traffic.

Duplicate rows meant the file doubled in size. The file crossed a hardcoded size limit. The proxy hit that limit and panicked.

A big chunk of global HTTP traffic went down from 11:20 UTC, with full restoration at 17:06.

Read that chain again. A database permissions change took down the edge proxy. Nobody wrote bad code. The query was fine yesterday. The data changed shape, and a latent size limit nobody remembered turned a config update into a crash.

AWS US-EAST-1, October 20

Different failure, same family. A DNS resolution issue, an empty DNS record produced by an automation race, cascaded through DynamoDB’s dependency chain. US-EAST-1 was degraded for about 15 hours. Downdetector logged over 17 million reports. Half the internet noticed.

Again, not load. A control-plane automation produced an invalid-but-accepted state, and everything downstream that depended on it fell over.

Config is data, and you don’t test it like data

Here’s the part that should bother you. You probably test your code. You have CI, unit tests, staging.

Your config? Your feature flags, your generated rule files, your DNS records? Those ship straight to prod with none of that. They’re “just data.” Until the data is wrong and the data is load-bearing.

Both 2025 outages were latent limits and unvalidated config meeting reality at the worst possible time.

What to actually copy

Put hard limits on the things that generate config. If a file feeds production, bound its size, row count, and shape, and fail the generation loudly, not the consumer at runtime. Cloudflare’s proxy found the limit the hard way. The pipeline should have found it first.

Fail static, not closed. When a config is malformed or missing, the safe move is usually to keep serving the last known-good version, not to crash and not to fail open. Decide this on purpose for every config that loads at runtime.

Validate config in CI like you validate code. Schema checks, size checks, “does this DNS record actually resolve” checks. Treat a config change as a deploy, because it is one.

Map your blast radius before the incident. Both outages cascaded through dependencies the owners didn’t fully picture. You should be able to answer “if this one thing returns garbage, what else falls over” on a whiteboard today, not at 3am.

And watch for shape changes, not just outages. Your monitoring probably alerts on errors and latency. It almost certainly doesn’t alert on “this generated file is suddenly 2x its normal size” or “this query returned twice the usual rows.” Those are the leading indicators, and most teams have zero coverage on them.

The bigger point

Forrester is predicting at least two more multi-day hyperscaler outages in 2026. That’s a forecast, not a fact, but the direction is right. As everyone piles AI infrastructure on top of aging control planes, the blast radius keeps growing.

You can’t prevent US-EAST-1 from having a bad day. You can decide, in advance, how much of your business goes down with it. That’s a design choice you make now or a postmortem you write later.


The math on what an outage actually costs you, war room, lost revenue, the trust you don’t get back, is in my breakdown of reliability costs.

— Youn