“Wait, not like that”: Free and open access in the age of generative AI

Artificial intelligence

The threat is not AI using open knowledge, but AI companies killing projects that make knowledge freely available

and).

There are scenarios that can cause doubts for those who contribute freely and openly to projects like Wikimedia or who release their own work under free licenses. These are what I call “wait, no. Not like that” moments.

What happens when a Wikipedian is shocked to discover that their carefully researched article was packaged into an ebook and sold on Amazon at someone else’s expense? No, not like that.

What happens when a developer of open-source software sees a tech company worth billions of dollars relying on their work, without giving anything back? No, not at all.

What happens when a nature photographer finds out that their freely licensed wildlife photograph was used in a NFT collection created on an environmentally destructive Blockchain? Wait, no. Not like that.

Or, perhaps, the most recent instance, when someone who publishes a work under a license that is free discovers that their work has been used to train large language models with extractive and exploitative intent by tech giants? No, not like that.

The reactions are understandable. We freely license our works to serve these goals: open and free access to education and knowledge. When companies that make trillions of dollars exploit this openness and give nothing back, or if our work is used in a harmful or exploitative way, it can seem like we were naive. The natural reaction is to try and regain control.

Many creators today find themselves in this situation, especially as a result of AI training. But the solutions that they are reaching for — more restricted licenses, paywalls or not publishing at any cost — risk destroying what they set out to create.

citation needed an independent publication, completely supported by readers like yourself. Consider signing up for an annual subscription. It will help me continue to do this work.

Often, the first impulse is to tighten licensing, perhaps by switching to Creative Commons’ (and therefore, non-free,) non-commercial license . Artists looked to Creative Commons when NFTs were popular in the early 2020s. They hoped that they would declare NFTs fundamentally uncompatible with their free licensing (they didn’t1). It happened again when generative AI companies began training models using CC-licensed work. Some were disappointed that the group took the stance that CC licenses do not prohibit AI training in its entirety, and that AI training should not be considered a copyright infringement by default. The creators who want this would be better served with a traditional model with all rights reserved, where any potential reusers must negotiate terms individually with them. But that undermines the idea of free and only allows reuse to those who have the time, money, and bargaining strength to negotiate. We know that major AI firms have been training their models to use all rights reserved works as part of their ongoing effort to ingest as many data points as possible. It’s best that this training was allowed by US courts as fair use. Some artists have decided that it is not worth it to maintain an online portfolio of their work if it makes it easy for AI training. Many have implemented content gates that restrict access — paywalls or registration walls, “are you human” walls, etc. — to try and fend off scrapers. This also closes the commons by making it more difficult or expensive for the “every single person” described in open-access manifestos to gain access to the material that was intended to be a common good.

By trying to exclude those deemed to be bad actors, many people actually exclude the very people to whom they intended to grant access. People who put their work behind paywalls probably didn’t intend to create works only the wealthy would be able to access. People who implement registration barriers probably didn’t mean for their work to be only available to those who are willing to endure the risk of constant email spam after giving their personal information. People who use CAPTCHAs to stave off robots, asking “are you human?” didn’t intend to limit their content to only those who could answer the annoying riddles.

If we want to create an equitable world where everyone can benefit from the sum of knowledge and education, culture, science, and technology, we shouldn’t be trying to build these walls. Does it matter if a child learns that carbon dioxide traps heat on Earth’s surface or how to calculate compound interests thanks to the work of an editor on a Wikipedia page? Or if they find out via ChatGPT, Siri or by opening a browser to Wikipedia.org?

I think that instead of worrying about “wait not like that”we should reframe the discussion to “wait not just ” or “wait not in ways that would threaten open access”. AI models that are trained on open access material do not pose a threat to the ability of people to access knowledge through new modalities. These models could stifle Wikipedia, and other free knowledge repositories. They would benefit from the money, labor, and care put into their support, while also robbing them of their resources. It’s the fact that trillion-dollar companies will become the sole arbiters for access to knowledge, after absorbing the painstaking efforts of those who made the knowledge available to all. They will also kill those projects.

Irresponsible AI firms are already imposing massive loads on Wikimedia’s infrastructure. This is costly, both in terms of bandwidth, but also for the engineers who must maintain and improve the systems to handle this massive automated traffic. AI companies who do not attribute or provide any other pointers back Wikipedia prevent users from knowing the source of that material. They also do not encourage users to visit Wikipedia where they could sign up as a contributor or donate if they see a request for help. This is the case with most AI companies. Many AI “visionaries”however, seem perfectly happy to claim that artificial superintelligence will be here soon but that attribution is a problem that cannot be solved.

While I use Wikipedia as an illustration, scraping can be done on any website that contains freely licensed content, with the benefit of AI companies, but at the expense of the content hosts. This isn’t about a single project. It’s about the systematic destruction of the infrastructure which makes open knowledge possible.

Anyone who thinks for a half-second at an AI company should be able recognize that they have a vampire relationship with the Commons. They rely on these repositories to survive, but their antagonistic and disrespectful relationship with creators reduces the incentive for anyone to make work public in the future (whether it is licensed freely or not). They drain resources without compensation from the maintainers of these common repositories. They reduce the visibility and awareness of the original sources. People are unaware that they could or should contribute to maintaining such valuable projects. AI companies should be interested in a thriving ecosystem of open access, which will allow them to continue to expand and update the models that they trained on Wikipedia by 2020. Even if AI firms don’t care about benefits to the public good, they shouldn’t find it difficult to understand that by bleeding out these projects, they are destroying the food supply for themselves.

Yet many AI companies appear to pay little attention to this. They seem to be more concerned with the months ahead of them than they are with operating on a long-term timescale. It is not surprising that AI companies do not behave as if they think their business will be sustainable for many years.

These companies would be wise to prioritize the health of the public so as not to end up strangling the golden goose. It would be wise for us to not rely solely on AI companies to suddenly and miraculously develop a conscience or to come to their senses.

We must instead ensure that mechanisms are put in place to compel AI firms to engage with these repositories according to their creators’ conditions.

There’s a way to do it. Models like Wikimedia enterprise, which allows AI companies to access Wikimedia data but requires that they use high-volume, paid pipes. This ensures that they don’t clog the system and makes them pay for the extra load. Creative Commons is experimenting the idea of “preference signal” — a non copyright-based model to communicate to AI firms and other entities the conditions on which they can or cannot reuse CC licensed works.

Some may argue that AI companies will ignore these mechanisms if they are already ignoring copyright, training and all-rights reserved works. There’s one crucial difference: we can create clear legal frameworks for consent and compensation based on existing labor and contract laws, rather than relying solely on murky copyright claims and threatening to expand the copyright in a way that would harm creators. Collective bargaining can be used to establish enforceable agreements among AI companies, those who freely license their works, and communities that maintain open knowledge repositories. These agreements would not only cover financial compensation for infrastructure costs but also requirements regarding attribution, ethical usage, and reinvestment into the commons.

In the future, free and open access will not be about saying, “wait, that’s not how it works” — but rather, “yes, that’s how it works, under fair conditions”. With fair compensation for infrastructure cost. With attribution, and ways to help new people discover the commons and give back. Respect for the communities who make the commons and the tools built on them possible. Only then can we build a world where everyone can freely share the sum of knowledge.


While I was writing this article, I discovered that SXSW panel delegates from Wikimedia Foundation, Creative Commons and the Wikimedia Foundation, titled “Openness under Pressure: Navigating The Future of Open Access (19459029]”discussed some of these same topics. I was scheduled to speak at the exact same time, so I was unable attend in person. The audio recording can be listened to online if you are interested in this topic!

www.aiobserver.co

More from this stream

Recomended