Spotify has confirmed it is investigating a large-scale scrape of its catalog after the pirate preservation group known as “Anna’s Archive” claimed to have stolen 200,000+ songs and up to 300TB of music from the streaming service.
The group said it scraped metadata for 256 million tracks and claimed to have audio files for 86 million of them, a haul it maintains constitutes fully “99.6%” of listening activity. A third party “illegally accessed” Spotify systems and used stolen or leaked login credentials to gain unauthorized access to some of the streaming service’s songs and recordings, according to a statement emailed by an outside PR firm representing the company. The use of “fraudulent” methods such as scraping open-source code from the network, and bypassing digital rights management software to get into some music files, was also reported in a similar message delivered via email from Spotify to Billboard.

What data and audio files were taken in the scrape
Anna’s Archive says the cache includes track names, artist and album information, and details of popularity for 256 million items, as well as tracks themselves for 86 million files. According to the group, it will be releasing torrents slowly in stages, starting by moving up according to how many times they were played (at least using Spotify’s publicly available popularity scores), and including album art but no additional metadata.
Bitrate is inconsistent throughout the collection: The group contends top songs are encoded at 160 kbps, while less popular tracks can drop as low as 75 kbps to reduce file sizes. We’ve done some quick back-of-the-envelope math for 300 TB: the average three-minute-long song at a bitrate of 160 kbps is around 3.6 MB, and when you’re talking tens of millions of files — many at lower bitrates — it’s easy to see the total move into several hundred terabytes.
How the scrape likely worked to bypass protections
Spotify says public metadata was scraped — a common occurrence on the open web — but the key claim is DRM circumvention to extract audio from protected streams. At scale, that usually involves artificially generated requests, IP-hopping infrastructure, session theft, or abusing holes in access controls. Spotify doesn’t elaborate on the vector, but the description suggests that the attackers did more than normal scraping to bypass content protections; it’s a line that would trigger anti-circumvention concerns under laws like DMCA Section 1201 in the US and analogous rules in Europe’s DSM Directive.
That the scraped library seems to reflect listener popularity also implies that the operators optimized collection for maximum cultural coverage, rather than “completeness” (i.e., how a comparable artist is missing from our data set only because it was unpopular, not due to degradation); an optimization decision that both slashes bitrate and increases the transferred value — a crude form of perceived “value” for larger coverage.
Why a 300TB cache of popular music matters now
Even for lossy bitrates, 300 TB of popular music is a significant act of preservation and piracy. For archivists and scholars, it provides a snapshot of the modern streaming canon that can be traced to real-world consumption. From the rightsholders’ perspective, it’s a huge unauthorized distribution pipeline that has the potential to wreck licensing economics if torrents spread widely.
The scale also raises questions of competitive intelligence: popularity scores, release cadences, and catalog gaps are useful signals for labels, indie distributors, and recommendation researchers. Although Spotify’s own recommendation system was not revealed, bulk access to popularity and play proxies can be used to back out behavioral patterns. Historically, industry bodies such as the RIAA and IFPI have moved quickly to shut down torrent indexes offering up pirate copies of major-label back catalogs, and similar action is likely here.

How artists and everyday listeners could be affected
We do not have any reason to believe that user accounts, personal data, or anything sensitive were breached; the breach concerns content files and publicly available metadata. Artists and labels now have the near-term risk of substitution; some level of consumption will move to torrents, but the lower bitrates and lack of platform features (song lyrics, playlists, social sharing) blunt that impact for a lot of listeners.
Another concern is uncontrolled reuse. For instance, large scraped audio sets can feed unlicensed machine-learning models that can be used to perform tasks such as source separation, vocal cloning, and genre synthesis. That risk has been on the minds of labels this year, as a top-secret string of lawsuits against unlicensed training on music and lyrics gained attention. A widely copied torrent could be more difficult to police, and the provenance murkier.
What Spotify and rightsholders could do next
Expect a multi-faceted response: technical mitigations (closing the scraping vectors, tightening token lifetimes and key rotation, and ramping bot detection), litigation aimed at distribution points, and coordinated takedowns by rightsholder groups.
If Anna’s Archive goes through with staggered releases, we can expect a longer game of cat and mouse over the months ahead on torrent sites and mirrors.
The problem for Spotify at its core is how to reconcile an open discovery platform — public artist pages and charts, say — with protections strong enough to discourage industrial-scale extraction. That usually includes rate limiting, anomaly detection of popularity-weighted access patterns, and a variety of DRM hardening. For the music industry, however, the mishap highlights a larger trend: streaming has funneled all of the world’s music onto a few platforms, and as such they’ve become high-value targets not only for pirates but anyone else trying to capture sweeping cultural datasets.
Spotify’s investigation is ongoing. The group responsible for the scrape, meanwhile, is framing the release as a preservation project. Whatever the framing, though, the disclosure exposes a fact of life in the era of streaming: when the catalog becomes the internet’s version of a commons, being able to protect — and profit from — those pipes is just as crucial as licensing those songs.
