Jess Weatherbed is a journalist who focuses on the creative industries, computing and internet culture. Jess began her career as a news reporter at TechRadar. She also covered hardware reviews.
Wikipedia wants to discourage artificial intelligence developers from scraping its platform by releasing a dataset optimized for training AI algorithms. The Wikimedia foundation announced on Wednesday that it has partnered with Kaggle, a Google-owned platform that hosts machine-learning data. Wikimedia said that the dataset hosted by Kaggle was “designed with machine-learning workflows in mind,” allowing AI developers to easily access machine-readable article content for modeling, fine tuning, benchmarking and alignment. The dataset, which is openly licensed as of April 15, includes research summaries and short descriptions, infobox data and article sections, minus references and non-written elements such as audio files.
According to Wikimedia, the “well-structured JSON” representations of Wikipedia content ( ) available to Kaggle should be an attractive alternative to “scraping and parsing raw article texts”. This issue is currently straining Wikipedia’s servers as automated AI bots consume the platform’s data bandwidth. Wikimedia has already signed content sharing agreements with Googleand the Internet Archive, but this partnership will make the data more accessible to smaller companies and independent data analysts.
Kaggle’s partnerships lead Brenda Flynn said that the machine learning community relies on Kaggle for tools and testing, so the company is thrilled to host the Wikimedia data. “Kaggle’s excited to play a part in keeping this data available, accessible, and useful.”