One of the creators of the classic web XML technology RSS is leading a new charge to try to put the web out of its misery, fixing the messy data economy, launching a protocol that he hopes will let web publishers ease licenses to use their content into model builders on the web at internet scale.

Called Real Simple Licensing (RSL), the initiative joins machine-readable standard with a collection rights body in the hopes of making paying for training data as simple as reading a robots. txt file.

Table of Contents

Why a licensing protocol now
Inside the Real Simple Licensing process
Who’s behind it — and why it matters
The difficult part: Monitoring, and paying for, use
will AI labs sign on?
What to watch next

Why a licensing protocol now

The timing is not accidental. Following a historic $1.5 billion copyright settlement with Anthropic and an expanding docket of some 40 pending suits over unauthorized scraping, the AI industry is under pressure to show it can legally obtain data. One suit works against digital image-generators for creating copyrighted characters such as Superman — an indication of how murky provenance and permission have become from text to images.

Meanwhile, model builders have proven to be willing to pay, provided the terms are transparent. Already, news publishers, software repositories, and forum operators have reached custom deals with leading labs. What has been lacking is a scalable system — and a shared language — for everyone else.

Inside the Real Simple Licensing process

RSL comes in two parts. First there is the RSL Protocol – a technical schema that allows publishers to express license terms in a standard, machine-readable snippet next to their robots. txt. Sites can, of course, dictate whether AI can be used, whether it is prohibited, or whether it requires a license to be worked out, as well as can point to Creative Commons terms and custom provisions. The aim would be to eliminate guesswork and provide explicit permissions for crawlers and retrieving systems to automatically respect.

Second is a legal regime to complement the technological one. The RSL Collective is similar to ASCAP in music or MPLC in film, in that it provides one negotiation and invoicing entity for royalties. Artificial intelligence companies have a single door to knock on; rightsholders get uniform terms without going through the process of assembling a legal team for every offer.

Who’s behind it — and why it matters

A cast of web heavyweights has pledged support. These early publisher and platform adopters include such outlets as Yahoo, Reddit, Medium O’Reilly Media, Ziff Davis, Internet Brands (also known as the owner of WebMD), People Inc., and The Daily Beast, among others, along with platforms like Quora, Fastly, and Adweek. Notably, there are already some backers thought to have more lucrative deals, such as Reddit, which is said to generate around $60 million each year from Google for data access, providing a clear indicator that market interest is there if the value in return is apparent.

For the long tail of the web: the small independent blogs, the niche forums, the specialty journals that have historically powered AI’s knowledge base, RSL brings leverage they couldn’t find alone. Rather than outright blocking bots or being scraped freely, smaller sites can join what amounts to a collective, one that sets terms and sees to it that they get paid.

Licensing protocol workflow shown with documents, padlock, and network nodes

The difficult part: Monitoring, and paying for, use

It’s easy to ascertain when a song is played; proving when a sentence trained a model is not. The low-hanging fruit is real-time retrieval systems — read AI search summaries — that can track who is responsible for each citing and by policy which can log usage. It’s trickier training large models on sprawling corpora. If ingestion is not logged at training time, all bets are off on whether a given document was used; Per-inference royalty schemes only add to the complexity if the same document reappears across multiple inferences.

RSL’s architects say “good enough” accounting beats perfect provenance. Usage reporting is already a requirement for enterprise deals. New standards such as C2PA for content provenance along with better training logs and model registries can lift the floor on observability. Even a hybrid of blanket licenses for training and metered fees for retrieval-augmented generation could significantly lower friction and disputes.

will AI labs sign on?

There’s cultural resistance to overcome. Frontier labs have treated the open web through sources like Common Crawl as a free raw material for years, and lines between “scraping” and “browsing assistance” are still being fought over — as we’ve seen in recent dustups between infrastructure providers and AI startups. But the market is changing: scale is flourishing among companies such as Scale AI and other data vendors, mainly because labs are willing to pay for high-quality, rights-cleared data, especially if regulators and courts start to require it.

Comments from industry leaders in public, from ringing calls for clear licensing frameworks at major business forums, suggest a hunger for a uniform set-up. RSL literally wraps a protocol layer around that sentiment, and challenges labs to turn their talk to adoption.

What to watch next

These first pilots will say volumes: retrieval products will be doing RSL first where attribution is native, and training licenses with pragmatic reporting. Regulators in the EU, UK, and U.S. are currently wrestling with issues around exceptions for text and data mining and norms for transparency; if a common protocol were to be taken up more broadly, it could serve as a de facto compliance mechanism. And let enough important publishes turn RSL terms on in robots. txt, labs may learn “open” data is no longer freely free.

RSS did for the web what the web couldn’t do for itself, and helped it articulate how content should be syndicated. RSL wants to do that for how content should be remunerated. Whether that ambition holds will rely less on technical grace than on the question of whether the AI ecosystem will decide that paying for data is just a part of doing business.