Generative AI is going to smash up against the norms that made open source work, and the fallout could be existential.

What was once an order based on attribution, reciprocal licensing, and community contribution is being tested by models that mix codebases as dependencies at scale without clear origins or commitments. If that tension is not resolved, the digital commons that underpins the entire internet may crumble just as demand for AI climbs.

Table of Contents

Why Provenance Is at the Heart of the Open Source Crisis
How Software Licensing Collides With Large Language Models
Why Economic Gravity Pushes AI Toward Closed Systems
Security and Compliance Risks Increase with AI Adoption
Walking a Tightrope to Keep the Open Source Commons Alive

Generative AI subverting open source reciprocity by scraping code repositories

Open source is not a side project — it’s the underpinning of modern computing. Synopsys’ OSSRA reports over several years have found that more than 90% of commercial software codebases contain open source. GitHub says it has over 100 million developers. But now generative AI engines are slurping up this commons and spitting out code that appears like it might be useful, but typically can’t be properly attributed, licensed, or contributed back.

Why Provenance Is at the Heart of the Open Source Crisis

Open source depends on knowing provenance: who wrote a line of code, where it originally lived, and under what licensing terms. Large language models compress their training data into billions of parameters and spew out snippets that might mimic GPL, AGPL, or other copylefted code — without attribution. The result is a form of “license amnesia” in which origin, authorship, and obligations are erased.

That matters because reciprocity is a feature, not a bug. Copyleft licenses depend on derivative works being licensed under the same terms. When AI-generated code inexplicably drops into codebases without a chain of custody, developers can’t comply with attribution and redistribution requirements. Compliance becomes a game of reading the entrails; maintainers can’t take patches that they don’t understand or can’t validate; the contribution loop is broken.

How Software Licensing Collides With Large Language Models

What (as opposed to who) is behind the code remains an unresolved and troubling legal backdrop. The U.S. Copyright Office has stated that it will refuse to register a claim if it determines that a human being did not create the work. That gives rise to a paradox: AI-generated code can be uncopyrightable, yet still embed protectable expression from training data. In the meantime, plaintiffs have sought to challenge the scraping and reproduction practices underpinning AI systems, from a class action against GitHub Copilot to higher-profile cases involving text and media.

For copyleft communities, the risk is lopsided. If the output of AI includes material that is essentially identical to GPL code, downstream users might take on unachievable obligations — even ones they are not aware of. On the other hand, if platforms treat AI output as “public domain by default,” they smash the very mechanisms of reciprocity that have kept the commons in place for several decades.

Why Economic Gravity Pushes AI Toward Closed Systems

The economics of AI are rigged to centralize power. Training frontier models requires large proprietary datasets, expensive compute, and specialized engineering — advantages concentrated with a few companies. Many “open” AI releases are in fact “open weight” or “source available” with usage conditions that do not satisfy the definition of open according to the Open Source Initiative’s definition. The result is containment: companies continue to use open source as an upstream asset, while downstream output formats, interfaces, and models remain behind a locked gate.

Such enclosure squanders the incentives to give back. When AI systems extract community work to fashion differentiated, closed products, maintainers lose visibility and leverage. In the long term, fewer maintainers means slower patching, fewer features, and projects being allowed to slip quietly into unsupported status — an outcome that challenges everyone’s software supply chain.

Security and Compliance Risks Increase with AI Adoption

Early research has flagged disturbing trends. In a paper led by academic experts from Caltech and Stanford, researchers reported an alarming proportion of insecure solutions produced by AI coding assistants — in one highly cited study, about 40 percent of the suggestions were found to have vulnerabilities. Combine that with opaque lineage and you wind up with code that’s both more difficult to secure and, ironically, harder to properly license.

We have already had a glimpse of the Gothic fragility of the commons. The Log4Shell crisis laid bare just how much of the internet’s life-support system can depend on a small volunteer team. If AI quickens the devouring maw of open source while suppressing upstream contributions, however, such maintenance bottlenecks only get worse. Vulnerabilities linger. Compliance audits become nightmares. The price ultimately falls on businesses and public institutions.

Walking a Tightrope to Keep the Open Source Commons Alive

There are escape ramps — but they take coordination. Model and data provenance standards, the AI equivalent of an SBOM, could help track when and how licensed code affects outputs. The Open Source Initiative’s project on the definition of open source AI aims to provide some guidance on what “open” should mean in the era of models. The EU’s AI Act pushes toward documentation and transparency requirements that may mitigate provenance blind spots.

On the supply side, training models on human-curated, license-respecting datasets and features to embed attribution metadata into suggestions would restore a measure of reciprocity. On the demand side, enterprises can stipulate provenance awareness from AI tools, help fund critical maintainers (through sponsors within organizations and foundations such as the Linux Foundation and OpenSSF), and enact policies that specifically prevent “mystery code” from being run in production.

Without these changes, however, the direction is grim. Generative AI will continue to suck value out of the commons even as it starves what remains of the culturally, socially, and legally enforced feedback loops that once made the old commons so resilient. Open source doesn’t fail all at once — it fails as maintenance slows, compliance risks loom, and innovation moves behind closed APIs. That’s not a future the software profession can afford.