An Amazon Web Services failure spread across the web, turning popular sites, apps and smart home systems off for a few hours. The disruption focused on the US-East-1 region in Northern Virginia, a gigantic hub that many companies consider their default. The accident was a clear reminder that one false move inside one major cloud can ripple across the internet.

What failed inside AWS and how the issue cascaded

AWS reported increased error rates and latency for core services such as EC2, Lambda and DynamoDB after identifying the root cause to be a DNS resolution issue with a DynamoDB API endpoint in US-East-1. When DNS breaks, the service is unable to translate hostnames into IP addresses that can be used, and each dependent system starts throttling or retrying.

Table of Contents

What failed inside AWS and how the issue cascaded
Why a single region failure can disrupt many apps
How the outage rippled across sites and services
What the data shows about the scope of the outage
How a DNS glitch can lead to a widespread outage
Lessons for engineering teams to improve resilience
The bigger picture and long-term resilience takeaways

AWS cloud outage on status dashboard showing widespread internet disruption

That interplay mattered. A lot of apps are now running session state, configuration and feature flags on top of DynamoDB. As SDKs tried to reissue requests, the load mushroomed into a “retry storm,” overwhelming network gateways and control planes. The old saw proved true: it’s always DNS, until it’s not.

Why a single region failure can disrupt many apps

US-East-1 is the busiest of AWS’s regions, and frequently where new services arrive first. It also serves as an important management hub, so incidents there can have outsized impact. Despite reads being distributed worldwide, many employ critical write paths still pinned to this area in order to meet latency and cost requirements.

Analysts at firms like Gartner have long cautioned about single-region and single-cloud concentration risk. If an app can’t fail over safely to another region — if things like DNS or identity or your data layer is tied to one location still — a relatively small fault can cascade.

How the outage rippled across sites and services

They produced a visible impact for both consumers and businesses. Users said they were having difficulty logging into services like Snapchat, Ring, Alexa, Roblox and Hulu as well as financial and AI-based ones like Coinbase, Robinhood and Perplexity. Even some of Amazon’s own retail and streaming properties experienced an initial disruption.

There were also reports of degraded services at major institutions outside the United States, emphasizing the global extent of US-East-1 dependencies. Smart home appliances became disconnected, back-office systems ground to a halt and customer support lines increased in size as apps waited to time out, instead of quickly admitting defeat.

What the data shows about the scope of the outage

The AWS Health Dashboard logged effects in 28 services during the event, an indication the blast radius was wider than a single product. Outage-tracking sites like Downdetector received more than 14,000 user reports for Amazon at the outage’s peak, and infrastructure monitoring companies observed spikes in DNS error rates and connection timeouts in North America and Europe.

AWS said that it took several avenues to recover in parallel. Upon fixing the DNS issue, the company recommended some customers flush their DNS caches in order to clear outdated records and enable fresh resolution. While most operations were restored promptly, a portion of services were still being throttled as capacity was stabilizing.

How a DNS glitch can lead to a widespread outage

DNS issues are uniquely pernicious. Endpoints with small TTLs can result in high lookup rates; if resolvers return errors, nodes back off and retry, leading to a surge of traffic. In microservice architectures, a single failed dependency—e.g., your metadata store or session store—can effectively lock login flows, checkouts and content delivery if compute and networking are otherwise healthy.

In cloud environments, private DNS resolvers, service discovery, and control-plane APIs all live on the same dependence tree. A bug in one layer can look like many different failures at the edge of an application, making diagnosis more difficult and recovery slower without protective patterns.

Lessons for engineering teams to improve resilience

Design for regional failure as a first-class citizen. Active-active across regions, DynamoDB global tables, or an equivalent multi-region data replication can help keep core user flows alive when a dependency stalls—and in read-only mode for noncritical features. Leverage circuit breakers, bulkheads, and request budgets to avoid retry storms.

Harden DNS. Use multiple resolvers, verify health checks, and pre-provision backup endpoints to switch over. Prudently cache configuration and identity tokens to tolerate transient control-plane failures. Chaos engineering exercises and game days that mimic DNS or data-store failures are important tools for reducing blast radius.

The bigger picture and long-term resilience takeaways

Cloud providers have strong aggregate uptime, but dependency chain brittleness persists. Uptime Institute research has consistently indicated that software glitches and change management anomalies are the two biggest culprits in major incidents, many of which have enormous financial implications.

This outage didn’t mean that “the internet” failed; it revealed a concentration risk. A single DNS ripple connected to a key data service in a dominant region was sufficient to darken substantial swaths of the online economy. The takeaway is clear: resilience should not be a feature you bolt on after the fact; rather, it’s an architectural decision you make up front.