Amazon Web Services’ services are again operating normally after a broad disruption that took down popular apps, websites, and connected devices across the internet for several hours. The issue began in Amazon’s US-East-1 region in Northern Virginia — the Amazon Web Services hub most heavily used by its customers — and spread to key services before engineers were able to stabilize the platform.

Why the outage spread across core AWS services

AWS first reported elevated error rates and latency for a variety of services, including Elastic Compute Cloud (EC2), Lambda, and DynamoDB, before narrowing the issue down to a problem with the Domain Name System that impacted the DynamoDB API endpoint. When DNS starts creeping in the control plane, the dependencies start to make it more painful: microservices can’t resolve endpoints; retries go up; back ends begin to throttle under the unusual load.

Table of Contents

Why the outage spread across core AWS services
How widespread the damage was across apps and services
Inside the recovery playbook AWS used to restore service
What community leaders say about cloud resilience and risk
What comes next as AWS reviews root causes and fixes

AWS cloud outage after major US-East-1 region failure

When the initial DNS failure was resolved, a second mode of failure occurred. This didn’t work out well as health checks by the Network Load Balancer began to misbehave, causing false negatives and failovers that removed otherwise healthy resources from rotation. This set off a second wave of timeouts and connection drops. At the height of the outage, dozens of services were affected, a textbook example of how cloud control-plane blips can spiral into customer-visible outages.

Engineers pursued multiple recovery paths in parallel — clearing the DNS condition, stabilizing load balancer health checks, and relaxing throttles on dependent services — to recover impacted Availability Zones without re-creating the situation. In the most pugnacious cases, customers were to flush DNS caches, an indication that old resolution data still haunted the edge.

How widespread the damage was across apps and services

The scale was unmistakable. Consumer-facing platforms such as Snapchat, Ring, Alexa, Roblox, and Hulu saw outages. Financial and crypto services such as Coinbase and Robinhood said they were experiencing problems. Even Amazon’s own retail site and Prime Video experienced partial degradation. In Europe, outages affected big banks — users named Lloyds Banking Group as one of the institutions they were having trouble with — and some public-sector portals.

The outage-tracking firm Downdetector, which is owned by Ookla, registered millions of user reports around the world — speaking not only to how widespread the services linked to AWS in back ends are, but also to how strongly felt even relatively brief disruptions are for companies stuck without critical information or systems. In the background, jobs stalled, authentication tokens expired, edge caches went cold — not just on headline platforms but in thousands of smaller websites and APIs that hiccupped.

The home was not spared. Smart doorbells, cameras, and voice assistants that use AWS for far-away computing lost the remote connection or were sluggish to respond, another reminder of how cloud centralization extends into living rooms and security systems. For some, it began with a silent doorbell or a voice assistant refusing to follow their commands.

Inside the recovery playbook AWS used to restore service

AWS said it rolled out fixes incrementally, validating changes in a single Availability Zone before moving on to the others. It’s a standard best practice to prevent compounding failures while an incident is live. Lambda, which depends on several internal subsystems, was a focus because teams kept seeing invocation errors after the DNS symptoms cleared. EC2 instance launches also experienced latency as teams rebalanced control-plane systems.

To avoid recurrences, the teams usually tweak DNS time-to-live values, recalibrate health check sensitivity, and adjust retry/backoff policies so that cascading retries don’t DDoS downstream services. Look for Amazon’s pending postmortem to include details around control-plane isolation enhancements and safeguards covering load balancer health logic.

AWS US-East-1 outage disrupts cloud services

What community leaders say about cloud resilience and risk

It was a core cloud incident and not an isolated app bug, analysts at Ookla said, pointing to the synchronization of the failure pattern across hundreds of apps.

And as Daniel Ramirez of Downdetector has previously noted, while large, fundamental outages are still the exception rather than the rule, they’ll become more noticeable as an ever-greater number of businesses converge onto a single cloud architecture.

“As we can see, the longer the chain of dependencies globally, the higher the risk that something will fail,” said Marijus Briedis, CTO at NordVPN. “If major world services are connected to a single system, there is always a risk that such interconnection will stop working in one region and affect many companies.”

And the countermeasures, while well-understood, are unevenly distributed:

Active-active multi-region setups
Cross-region failover drills
Vendor-agnostic DNS configurations
Explicit dependency graphs to avoid masked single points of failure

And for teams running on AWS, the reliability pillar of the Well-Architected Framework is still your rulebook:

Distribute workloads
Decouple tightly bound services
Apply conservative timeouts and jittered retries
Run chaos drills that include DNS and load balancer failure scenarios

Customers best able to weather the storm were those already multi-region/multi-cloud for critical paths.

What comes next as AWS reviews root causes and fixes

AWS says the underlying issues have been fully resolved and residual throttling was cleared as systems returned to steady state. What is needed is an autopsy, per se, which should tell us the chain of root causes, blast radius, and corrective actions. Enterprises will be examining DNS resolution behavior, load balancer health check thresholds, and control-plane isolation.

The outage has passed, but the lesson remains: The internet is only as resilient as its shared roots. When something fails in a key region like US-East-1, the effects are felt worldwide. Constructing for that reality, presuming certain dependencies will fail and designing in graceful degradation, is the only long-term strategy.