Wikipedia servers are under pressure from AI scraping robots

Serving technology enthusiasts for over 25 Years. TechSpot is a trusted source for tech advice and analysis.

Editor’s take: AI robots have recently become a scourge for websites that deal with written content or media types. OpenAI, other tech giants and Wikipedia are all looking for new content to feed their AI algorithms.

Wikimedia, the nonprofit organization that hosts Wikipedia and other popular websites, has raised concerns over AI scraper robots and their impact upon the foundation’s bandwidth. Since the beginning of 2024 there has been a significant increase in demand for content hosted on Wikimedia’s servers. AI companies are actively consuming a large amount of traffic to train and test their products.

Wikimedia Projectsinclude some of largest collections of freely accessible media and knowledge on the internet. They are used by millions of people around the world. Wikimedia Commons (19459024) alone hosts 144,000,000 images, videos, other files, and shared under a Public Domain License. It is particularly suffering from the crawling activity by AI bots.

Since January 2024, the Wikimedia Foundation’s bandwidth usage for multimedia downloads has increased by 50 percent. The traffic is primarily coming from bots. The foundation claims that automated programs scrape the Wikimedia Commons catalog to feed content to AI models. However, the infrastructure is not built to withstand this type of parasitic traffic.

The team reported that the 2.8 million users who read the president’s biography and achievements were’manageable’. However, many users streamed a 1.5-hour video of Carter’s debate with Ronald Reagan in 1980.

Due to the doubling of network traffic, some of Wikipedia’s connections to the internet became congested for about an hour. Wikimedia’s Site Reliability Team was able reroute traffic to restore access. However, the network hiccup should never have occurred.

Wikimedia discovered that 65 percent of its most resource-intensive traffic was generated by bots. These bots passed through the cache infrastructure, directly impacting Wikimedia’s core data center.

Wikimedia is working to address the new network challenge that is now affecting all of the internet as AI and tech firms are actively scraping any human-made content. The organization stated “Delivering trustworthy content also means supporting a ‘knowledge as a service’ model, where we acknowledge that the whole internet draws on Wikimedia content,” .

Wikimedia promotes by better coordinating with AI developers. Dedicated APIs would ease the bandwidth burden and make it easier to identify “bad actors” within the AI industry.

Wikipedia servers are under pressure from AI scraping robots

Google wants $250 (!) per month for its new AI Ultra...

Google refuses to give publishers options in AI Search

This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective...

Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated...

Recomended

Google wants $250 (!) per month for its new AI Ultra plan

Google refuses to give publishers options in AI Search

This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DB

Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data

Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden Gaps

Agentic AI in Financial Services: IBM’s Whitepaper Maps Opportunities, Risks, and Responsible Integration