Amazon Web Services says a widespread outage that knocked cloud services offline from Adobe to GitHub to Slack has been fixed, and it reported “high error rates” in its Northern Virginia region were due to DNS issues affecting DynamoDB API endpoints. After several hours of high error rates from several AWS services, the company said that “we’re operating normally across all our services and mitigation is complete.”
How DNS Failures Disrupted AWS Services and Clients
DNS is the internet’s phone book, resolving human-friendly names into IP addresses. If those lookups are failing, or returning old results, services which would use them cannot connect—even if the compute and storage itself is healthy. AWS attributed the incident to DNS resolution for DynamoDB endpoints within us-east-1, a region so jam-packed with customers’ production workloads that it serves as an extension of the customers’ own environments for many. All of that together was sufficient to ripple out through dependent systems and cause timeouts and slow responses.
Engineers mitigated the DNS issue and subsequently worked through lingering impact as caches expired and clients retried. That post-fix tail is common in DNS-related incidents; resolvers obey time-to-live settings, so you can stagger application recovery until every layer of the stack—down to the enterprise resolver and up to SDKs—is awarded fresh records.
Scope of the Disruption Across Apps and Platforms
The outages extended far beyond developer tools and backend systems. Status pages and user reports indicated that several other major apps and platforms, including Coinbase, Fortnite, Signal, Venmo, and Zoom, were also having trouble. Amazon’s own services like Ring were impacted, as was connected hardware like Eight Sleep’s cooling pods, which reportedly exhibited free-for-all behavior. Amazon.com properties and customer support were impacted by knock-on issues as DNS resolution broke.
This kind of importance highlights AWS’s centrality to the modern internet. AWS has about a third of the worldwide market for cloud infrastructure, according to the Synergy Research Group, and millions of organizations run production workloads on its platform. And when a foundational service like DNS falls down in such a large region, the blast radius can include everything from payments and messaging to gaming and smart-home products.
Why One AWS Region Outage Can Have a Global Impact
us-east-1 is not just big; it is a primary region for lots of customers and an anchor for multi-region storage workloads and control planes. Services such as DynamoDB are frequently embedded with authentication and session management, event or stream pipelines, and operational metadata. If unable to resolve the service endpoint, then requests fail before even coming near the database—causing cascading falls across microservices.
This is why AWS’s own Well-Architected Framework puts a heavy emphasis on multi-region design, health-based routing, and defensive client behavior. Some common patterns are:
- r/w paths
- Regional isolation
- Circuit breakers to prevent retries from making matters worse
But DNS is one rung down lower than much of that in the stack; when name resolution itself breaks it creates a setback even the best-designed failover strategies can find difficult to overcome until resolvers update.
How This Outage Compares to Previous Internet Disruptions
As the cause here was DNS resolution, it fits a broader pattern of foundational-layer failures generating outsized impact. A content update from the cybersecurity firm CrowdStrike in 2024 caused crashes of Microsoft Windows machines after introducing flawed content, potentially disrupting airlines and hospitals. That followed a 2021 incident at DNS provider Akamai that briefly brought down the websites of FedEx and PlayStation Network. Different triggers, same lesson: The internet is only as resilient as its weakest common dependencies.
What Consumers Can Do Now to Improve Resilience
AWS recommends customers check the AWS Health Dashboard for specific service-impacting events and remediation action plan information.
Past triage into stability, teams should reconsider architecture assumptions:
- Tune DNS TTLs for failover
- Consider alternative regional endpoints that are actually reachable
- Test client-side resolvers, exponential backoff, and connection pools under simulated DNS failure
Observability must include the resolver path—not just service health—so operators can tell a network-name failure from a service-side error in minutes.
For organizations that have very stringent uptime objectives, exposure can be further reduced by employing multi-provider DNS, isolating essential control-plane dependencies per region, and conducting disaster recovery exercises involving the explicit injection of DNS faults.
The site reliability engineering playbook should explicitly state when to cut traffic, when to relax timeouts, and how to give preference to partial functionality (read-only modes) until dependencies heal.
For the time being, AWS reports that systems are up and running again. The proof in the pudding is going to be what post-incident analysis looks like and how it changes things—both with AWS’s own DNS paths to Route 53, but also across customer architectures that just learned (once again) how much of interneting hinges on timely and correct name resolution.