A series of outages swept across popular apps and websites on Thursday, with some users experiencing issues for more than two hours. Amazon Web Services confirmed a problem that impacted several services and stated that solutions had been implemented. Early indicators point toward a recovery, but some requests may still be behind as systems catch up.
What happened and why the AWS outage matters today
High error rates and latency within the US-EAST-1 region caused a widespread outage, according to updates posted on AWS’ Health Dashboard. AWS later also identified a probable DNS resolution problem for the DynamoDB API endpoint in that region. If DNS can’t resolve against a key database endpoint, requests will simply not make it to the service — which in turn leads to failed logins, broken session management, and absent content across downstream apps.
This spread fast because so many companies anchor workloads in US-EAST-1, a region beloved for its cost-effectiveness, capacity, and proximity to major user populations. AWS dominates the global market for cloud infrastructure, with about a third of it, according to estimates from Synergy Research Group, and US-EAST-1 is one of its busiest regions. When something as fundamental as DNS resolution to DynamoDB goes wrong there, the blast radius can cover consumer apps, enterprise tools, and media platforms all at once.
Who was affected by the AWS outage and how users felt
Posts on social networks and outage-tracking websites suggested that problems had spread to:
- Prime Video
- Snapchat
- Asana
- Slack
- Grammarly
- PlayStation Network
- Signal
Symptoms differed by app and architecture: some users saw infinite spinners on login, others lost access to real-time features, and still others were outright 5xx’ed.
Concrete examples were in line with the pattern of that database and DNS disruption. Editors like Grammarly reported ¯\_(ツ)_/¯ loading failures, collaboration suites such as Slack experienced delayed or flubbed messages, and streaming services ran into playback errors. Gaming networks had trouble with authentication or store access. This did not affect all users, and many customers who use multi-region failover or aggressive caching were able to mitigate the event by operating with incomplete functionality.
Current status of AWS services and ongoing recovery
AWS says that mitigations are in place and most requests for its services are succeeding, while backlogs are cleared. That recovery phase often has lasting effects; queues that need draining, caches that must be repopulated, and clients chasing retries all make for thundering-herd traffic issues that extend the period of high latency. Services are starting to get back to normal, but random slowdowns or outages can continue as the system stabilizes.
Historical AWS incident analysis has suggested that dependent services may need time to “wake up” and re-establish a state of synchronization with non-essentially related features even after the root cause is mitigated. Phased recovery is a common approach — with core functionality being restored first, and background jobs, analytic pipelines, or region-specific experiences being brought online as capacity availability allows.
What users and engineering teams can do during outages
For the end user, local troubleshooting rarely addresses cloud-side events. The only thing you can really do is be patient, avoid repeated logins that might lock an account, and consult a site like Is It Down Right Now or Downdetector to see if there’s been an uptick in error reports before doing something drastic. There may be limited options, such as temporary alternatives or working offline until affected platforms are stable.
Engineering teams can cushion against similar such occurrences by relying on resilient patterns. That means:
- Multi-region or multi-AZ architectures for stateful systems such as a session store
- Circuit breakers to ensure that runaway retries don’t bring down the system
- Exponential backoff with jitter
- Graceful degradation that keeps critical paths alive while ancillary services fail
- DNS best practices: sane but short TTLs, health-checked endpoints, and fallbacks
- Load shedding at the edge and chaos testing to reveal lurking dependencies
The bigger picture for cloud resilience and concentration risk
And concentration risk is still a thing in modern cloud computing. With just a few providers powering much of the internet, rare but high-impact events can have far-reaching effects. “Severe outages that have material business and customer impact persist, yet many of them are the result of configuration, networking or dependency failures — the very areas today’s event focuses on,” the Uptime Institute said.
This is not to say that the cloud is fragile, just that scale magnifies both success and failure modes. When a heavyweight like AWS has regional indigestion — particularly for key services, such as DNS and DynamoDB — the impact is felt broadly, not just within one app. The good news is that the fixes are in, recovery is going strong, and the industry has a growing arsenal of tools to help ensure this time that the next hiccup stays off the front page.