Cloudflare has revealed the cause of a major outage that affected portions of the internet in July, and confirmed that its deployment system was to blame for some customers not being able to access their virtual machines.

The file swelled much larger than it was supposed to, crashed a crucial traffic-handling system, and temporarily degraded multiple services. The company said there is no evidence of an attack and that a full postmortem will be published.

Table of Contents

What Cloudflare says happened during the July network outage
How an oversized config file disrupted global traffic
Why the outage impact was so broad across the internet
What changes Cloudflare plans after the July outage
What users and teams can do now to build resilience

A professional, enhanced image of a website analytics dashboard, resized to a 16:9 aspect ratio. The dashboard displays metrics like unique visitors, total requests, percent cached, total data served, and data cached, along with quick actions, plan extensions, support resources, and API information. The background is a professional flat design with soft gradients, ensuring the dashboard remains the central focus.

What Cloudflare says happened during the July network outage

Cloudflare engineers noticed an uptick in traffic to one of its services, the company said. As a result, an automated process created an artifact configuration to route and manage potential threats. That corpus, which was supposed to be limited in size, however, exploded with far more entries than expected. As the system that manages that configuration pushed those changes, an entire service effectively shut down as it attempted to process a new rule config — causing some traffic on Cloudflare’s network to be blocked.

The outage window started at around 11:20 UTC and was fully over by 14:30 UTC as Cloudflare continued to bring dashboards back, and then finally a broad set of application features. The company warned that users might experience short periods of slow service following the incident as traffic recovered, but that services should return to normal. A more comprehensive engineering breakdown will be published on the company’s blog once the investigation has concluded.

How an oversized config file disrupted global traffic

Automatic rule generation is default for big edge networks: telemetry discovers dangerous patterns and adds to a ruleset compiled and pushed worldwide. The failure mode here was a classic one at hyperscale — unbounded growth. When a configuration completely expands beyond expectations, downstream software can run out of memory, hit parsing limits, and even cause crash loops on reloads. Even a small 5% reduction in performance headroom can push busy clusters into a chain of rolling restarts when demand peaks.

Guardrails in resilient systems range from hard caps on file size (see the CaPM section) to schema validation that rejects pathological inputs, staged rollouts such that a small slice of the edge is exposed prior to rolling out 100%, and kill switches (anecdotally, you’re nowhere near production if you haven’t dialed up and then down a kill switch for a change under load) which shoot your traffic back onto a known-good path. Cloudflare’s acknowledgment indicates that at least one of those safety nets didn’t trigger in time to prevent an outsized configuration from rolling out to boxes on the hot path of traffic.

Why the outage impact was so broad across the internet

Cloudflare is in front of millions of websites and APIs, providing content delivery, security, DNS, with a vision to provide zero-trust access. When edge core traffic-processing components act up, badness percolates to end users very quickly — pages that fail to load, API requests that time out, and transient 5xx errors are prevalent. The timing magnified the impact, as morning logins spiked across the world’s major regions, creating sharp surges in traffic that coincided with growing network disruptions stemming from the configuration error.

A screenshot of the My websites page with a dropdown menu open, showing options like DNS settings, CloudFlare settings, Page rules, Development mode, Pause CloudFlare, and Delete domain.

The event is a reminder of other high-profile infrastructure incidents. Fastly’s 2021 outage, caused by a dormant bug in the software that was brought to life by an innocuous routine configuration change, cascaded worldwide and lasted for about 49 minutes. It’s also worth noting that Akamai has previously seen configuration-related issues interrupt edge and DNS services. The through line is easy to draw: in distributed networks functioning at internet scale, one bad knob or code path can spread globally in seconds.

External tracking services, such as W3Techs and BuiltWith, regularly identify Cloudflare as the most commonly used reverse proxy service or CDN for many of the world’s top visited sites. That footprint means outsized downstream effects if something goes wrong — which makes strong change control and fast rollback controls a requirement.

What changes Cloudflare plans after the July outage

From the description, obvious fixes are enforcing hard limits on the size of auto-generated config, stronger pre-deployment validation, canary rollouts to a fraction of the edge for a new config, and clear abort conditions if some config is causing a high error rate. Teams frequently put compile-time checks into place that reject configurations beyond certain limits, and separate control-plane updates from data-plane processing to prevent crashes on the request path.

Cloudflare has said it regrets the disruptions and that it would benefit from the incident. The upcoming postmortem ought to explain the specific code path that hit a snag, what monitoring caused the ruleset ballooning to be detected, and what guardrails are being put in place on top of those. Customers will be looking for firm dates and clearly measurable service level targets.

What users and teams can do now to build resilience

For businesses running on edge networks, best practices here are closely watching your provider status page, setting up synthetic checks outside of your primary CDN, and don’t fixate on a too-aggressive DNS TTL since you need time to fail over in good order. Multi-CDN and regional traffic steering reduce single-vendor risk where possible, but with complexity and cost.

And for most end users the work is already done, right? It’s a lesson not in alarm so much as design: On today’s internet, resilience is as much about the discipline of configuration as it is lines of code.