Cloudflare, one of the largest internet security companies, posted a detailed incident report that revealed it briefly took itself offline because of a bug in its system for tracking abuse of its services.

Please visit Search Engine Land for the full article.

Table of Contents

How a dashboard bug triggered a self‑DDoS at Cloudflare
Why the service cascade was so disruptive across APIs
What customers likely noticed during the Cloudflare outage
Lessons for API resilience and defending control planes
Not even DDoS experts are safe from self‑inflicted outages

Image for Cloudflare’s API bug produced a self-DDoS outage

How a dashboard bug triggered a self‑DDoS at Cloudflare

Client-side changes in the Cloudflare dashboard resulted in a “problematic object in its dependency array,” which caused the component to re-run, hitting the Tenant Service API over and over again, according to a company engineering blog. What ought to have been one authorization check per view was now a loop of redundant requests, exponentially increasing traffic and effectively DDoS’ing the service from inside the house.

The traffic spike was not malicious, engineers Tom Lianza and Joaquin Madruga said. It was a retry storm of UI logic, an unintentional one — the sort of thing that can regress subtly and so make it past tests, but then avalanche in production at scale. Every unnecessary call was load; multiplied across users, it became tens of thousands and had overloaded an important control plane.

Why the service cascade was so disruptive across APIs

The Tenant Service is located within the API authorization path of Cloudflare. When that service slowed down, authorization checks were unable to finish, causing 5xx errors across various APIs and taking parts of the dashboard offline. It’s textbook: a cowpath on a common dependency grinds to a halt; the slowdown is then transmitted across other, unrelated endpoints.

Such is the catch-22 of tightly coupled microservices. It only takes one single noisy neighbor to bring down a neighborhood where guardrails are thin — especially if retries are unbounded and clients treat failure as an invitation to try again now.

What customers likely noticed during the Cloudflare outage

Users started complaining about dashboard failures, and API calls responded with server errors. Because the outage affected internal authorization and not the global edge, it didn’t resemble the widespread application downtime of previous industry outages. Yet, for developers who were manipulating DNS, firewall rules or Workers scripts using the dashboard or API, it was a real inconvenience.

It’s a reminder that reliability is not just about keeping traffic moving at the edge; it’s also about defending the control plane that spits out and approves that traffic.

Lessons for API resilience and defending control planes

There are some familiar anti-patterns embedded in this failure. Client-driven storms require server-side rate-limiting and back pressure: token buckets to bound per-tenant or per-IP calls, sliding windows for burst smoothing, “shed load” responses before services fall on their knees. Bounded exponential backoff and jittered retry, as suggested in Google’s Site Reliability Engineering guide, prevent clients from locking into a single destructive pulse.

Platform-side, circuit breakers and bulkheads prevent an authorization service from being a systemic single point of failure. Short-TTL caches on non-sensitive attributes can pare redundant calls; idempotency keys can defuse duplicate work; and feature flags with safe rollback allow teams to back out a bad change faster than a full deploy cycle. Chaos testing — popularized by Netflix’s engineering practice — can expose where retry storms and dependency loops are going to bite when the going gets tough.

Not even DDoS experts are safe from self‑inflicted outages

The irony burns because Cloudflare will repeatedly release reports detailing record-breaking DDoS waves and how it blocks gargantuan application-layer floods from its network.

But history reveals that the self-inflicted outage is a hazard, without respect to party or ideology. One configuration flag led to a highly publicized 2021 CDN outage for another provider, while in the same year, a routing change was responsible for blacking out global usage of one social network for hours. Complexity is undefeated.

Cloudflare said that systems were returning to normal and apologized for the disruption. The lesson is not that someone got something wrong; it’s about the system itself. When a dashboard running on a platform inadvertently brings down the platform, you make your resilience and reliability patterns assume that any client — even one of your other clients — is misbehaving. Build for that, and self‑DDoS becomes a test case instead of tomorrow’s headline.