The internet’s archive is crumbling. Researchers noticed a sharp falloff in how frequently the Wayback Machine was taking snapshots of news homepages, alarming journalists, librarians, and open-web advocates: They have found themselves increasingly having to memorize URLs if they want to record how something appeared at a certain time or be able to prove that information has been subsequently edited after publication.
A Nieman Lab analysis found that the archive preserved 1.2 million screenshots of 100 major news homepages earlier this year, but only 148,628 from a comparable recent period — an 87 percent decline.
One such marquee example: CNN’s homepage dropped from 34,524 captures in the previous window to just 1,903 in the later one.
A Sudden Drop With No Clear Explanation or Cause
The Wayback Machine, a nonprofit initiative of the Internet Archive that normally crawls some 500 million web pages per day, said that certain archiving initiatives faced disturbances and that it had archived captures which have not yet been indexed as well. Simply put, the snapshots could have been taken without showing up in public search results — something that would be expected given operational restrictions and resourcing challenges, the organization added.
Backlog indexing does occur in large web archives; however, a cross-archive dip of these proportions for this duration is uncommon. But the Internet Archive has not provided a level of technical detail that would specify which of its crawlers have died, which collections are impacted, and how much data may be sitting in a queue waiting to be indexed — causing outside observers to speculate about reasons based on which captures they can’t find.
Why Reducing Archiving Puts Record at Risk
News homepages are the constantly changing front doors to coverage, and when they don’t reflect editorial priorities, it means that fewer readers will find out about important stories. High-frequency archiving allows the verification of what was published when, the tracking of stealth edits and takedowns, and the examination of how major news events were presented to users at each site. As captures decline, so too do those accountability and research functions.
Unlike with print newspapers, which libraries systematically collected and preserved, for the most part digital news output has been ephemeral. If you’re a homepage with millions of visitors and you refresh dozens of times a day but are only clocked sporadically, key moments disappear. Fact-checkers, researchers, and investigative journalists rely on these sometimes densely packed timelines of snapshots in order to reconstruct an account long after websites have moved on.
Possible Factors Behind the Wayback Slowdown
Resources are the simplest explanation. Public financial statements indicate that the Internet Archive’s expenses are far exceeding revenue, and that some of the largest warehouses are pinching crawl capacity (the portion of its budget spent on crawling), storage, and indexing throughput. Web archiving is expensive: bandwidth, compute for parsing and deduplication, petabyte-scale storage, staff to run it all.
Technical headwinds may be adding to the pressure. As more publishers deploy bot-management tools, aggressive rate limiting, and dynamic JavaScript frameworks that thwart old-school crawlers, robots.txt directives can be used to block or throttle archiving, and personalized or paywalled experiences are more difficult to replicate with fidelity. Any one of these factors might play a role at the margins — while the scope and timing of the slowdown that Nieman Lab documented seems like a resourced response, not just site-by-site friction.
The archive has also had security and reliability issues in recent memory, including a significant breach that caused long downtime. Even when service comes back, backlog and reprioritization can ripple through pipelines, delaying nonessential tasks like frequent snapshots of the homepage.
What the Archive Says About the Reduced Captures
Mark Graham, who heads the Wayback Machine, has said that a hiccup in some archiving projects resulted in reduced captures for some sites and that a chunk of the “missing” snapshots will be available after indexing is fully completed.
He presented the delays as reflecting operations rather than a mission shift. Nevertheless, the organization has not released a public schedule for clearing out the queue. “Nor does it typically take that long — months are unusual for a system that usually bounces back from new captures quite rapidly,” says Jean Pagé.
The Broader Safety Net for Web Archives Is Thin
Some other organizations do archive the web — the Library of Congress programs, national web archives in Europe, and members of the IIPC (International Internet Preservation Consortium), as well as datasets from Common Crawl. Many libraries purchase subscriptions to Archive-It, a service of the Internet Archive, which provides focused collections. But none have the frequency and scope of the Wayback Machine for news homepages — which is why a long slowdown would matter.
What to Watch Next as Archiving Services Recover
Recovery symptoms would be denser daily capture timelines returning for the big outlets and older “missing” snapshots reappearing as indexes are gradually rebuilt. More transparency — a public status page for crawlers and indexing, clearer documentation of backlogs — would help users calibrate expectations and pinpoint gaps early.
In the meantime, editors and researchers can reduce their risk by forcing critical captures of the pages they research or edit using Save Page Now, coordinating institutional collections through Archive-It, and exporting their own change logs.
The open web does need redundancy, but for day-to-day news history the Wayback Machine is the keystone that’s still standing — and its precipitous slowdown is a sign of how fragile our digital memory has become.