Revitalizing Endangered Languages on Wikipedia: Challenges and Opportunities
Greenlandic Wikipedia: A Cautionary Tale
When Kenneth Wehr assumed responsibility for the Greenlandic-language Wikipedia four years ago, his initial step was drastic: he erased nearly all existing content. His rationale was clear-only by starting fresh could the platform hope to survive. Although Wehr, 26, is German by birth, his fascination with Greenland, an autonomous Danish territory, began during a teenage visit. Over the years, he authored numerous obscure articles about Greenland in German and eventually relocated to Copenhagen to study Greenlandic, a language spoken by approximately 57,000 Indigenous Inuit people dispersed across remote Arctic settlements.
Launched around 2003, the Greenlandic Wikipedia had accumulated roughly 1,500 articles contributed by hundreds of editors by the time Wehr took charge. This seemed to exemplify Wikipedia’s crowdsourcing success, even in less commonly spoken languages. However, Wehr soon discovered a troubling reality: almost all articles were created by non-native speakers, many relying heavily on machine translation tools that produced content riddled with errors-from basic grammar mistakes to factual inaccuracies, such as an article erroneously stating Canada’s population as 41. Some entries contained nonsensical strings of letters, reflecting the inadequacy of AI translators for Greenlandic’s complex linguistic structure.
The Broader Impact of AI-Generated Content on Minority Language Wikipedias
Greenlandic’s predicament is far from isolated. Wikipedia hosts editions in over 340 languages, with hundreds more in development. Many smaller language editions have been inundated with AI-generated, machine-translated content. For example, volunteers working on four African language Wikipedias estimate that 40% to 60% of their articles are unedited machine translations. Similarly, audits of the Inuktitut Wikipedia-a language closely related to Greenlandic and spoken in Canada-reveal that over two-thirds of multi-sentence pages contain AI-generated segments.
This influx of low-quality AI content creates a dangerous feedback loop. AI translation systems like Google Translate and ChatGPT learn from vast amounts of online text, often using Wikipedia as a primary data source for under-resourced languages. When Wikipedia pages contain errors, these mistakes become embedded in AI training data, leading to increasingly flawed translations. This cycle threatens the integrity of minority languages online, as poor-quality content proliferates and misrepresents linguistic nuances.
Understanding the Data Dilemma in AI Language Models
Kevin Scannell, a computer scientist specializing in endangered languages, explains that AI models rely solely on raw textual data without access to grammar guides or dictionaries. This limitation is especially problematic for languages with limited digital presence. In 2020, research showed that Wikipedia constituted over half of the training data for AI models translating several African languages, including Malagasy, Yoruba, and Shona. A 2022 German study further identified Wikipedia as the only readily accessible online linguistic resource for 27 under-resourced languages.
The consequences are profound: poorly written Wikipedia articles can accelerate language decline by disseminating inaccurate information and discouraging native speakers from engaging with digital content in their mother tongue.
Balancing Automation and Quality: The Role of AI in Wikipedia
Wikipedia has long utilized automation for routine maintenance tasks such as fixing broken links and correcting spelling errors. Bots also generate short, formulaic articles on topics like rivers or animals, generally enhancing the platform’s breadth. However, AI-driven content creation presents unique challenges. Unlike traditional bots, AI tools can produce large volumes of content rapidly, but often with questionable accuracy, especially in minority languages.
Wikipedia’s community-driven model has so far shielded it from the widespread disinformation seen on social media platforms. Yet, smaller language editions suffer from a lack of active contributors, making them vulnerable to “Wikipedia hijackers”-users who flood pages with machine-translated content without sufficient linguistic knowledge. This phenomenon is exacerbated by the availability of tools like Google Translate, which, while improving, still struggle with languages that have complex grammatical structures, such as Greenlandic’s agglutinative morphology.
Case Studies: The Struggle of African and Indigenous Languages
In Nigeria, Abdulkadir Abdulkadir dedicates three hours daily to editing the Fulfulde Wikipedia, a language spoken by pastoralists and farmers across the Sahel. He emphasizes the critical need for accurate, manually translated content, warning that machine-translated articles can mislead readers and cause harm. Despite his efforts, he estimates that 60% of Fulfulde Wikipedia articles remain uncorrected machine translations, casting a bleak outlook for the platform’s future.
Similarly, Lucy Iwuala, an Igbo language contributor, highlights the damage caused by automated translations that leave untranslated English terms and nonsensical characters in articles. She stresses that such poor-quality content discourages users and undermines efforts to preserve Igbo online.
In Hawaii, assistant professor Noah Ha’alilio Solomon reports that up to 35% of words on some Hawaiian Wikipedia pages are unintelligible, threatening the language’s revitalization efforts. Hawaiian, once nearly extinct, has seen a resurgence through community activism and education, but inaccurate online content risks undoing decades of progress.
Unintended Consequences: AI-Generated Learning Materials
Beyond Wikipedia, AI-generated language learning resources have emerged, often containing significant errors. Linguist Richard Compton reviewed an AI-produced Inuktitut phrasebook sold on Amazon and found it to be “complete nonsense.” Such materials can mislead learners and hinder language preservation, especially in communities striving to reclaim their linguistic heritage.
Community-Driven Success: The Inari Saami Model
Contrasting these challenges, the Inari Saami Wikipedia exemplifies how dedicated communities can harness Wikipedia for language preservation. Once on the brink of extinction with only four child speakers, Inari Saami has rebounded to several hundred speakers, supported by schools and a thriving Wikipedia edition with over 6,400 meticulously edited articles. The Inari Saami Language Association prioritizes quality over quantity, integrating Wikipedia into educational curricula and using it as a living repository for new vocabulary, especially for modern concepts like sports and technology.
Fabrizio Brecciaroli of the association notes that while AI tools like ChatGPT struggle with Inari Saami, consistent input of high-quality content can eventually improve AI outputs, offering hope for breaking the “garbage in, garbage out” cycle.
Looking Ahead: The Future of Minority Languages on Wikipedia
Despite isolated successes, many languages face an uphill battle. Kenneth Wehr’s efforts to revive Greenlandic Wikipedia have met with little community engagement, leading to a decision to close the edition and move remaining content to the Wikipedia Incubator. This reflects a broader dilemma: without active native speaker involvement, small-language Wikipedias risk becoming repositories of inaccurate, AI-generated content that further endangers linguistic heritage.
UNESCO estimates that a language dies every two weeks, underscoring the urgency of preserving linguistic diversity. The Wikimedia Foundation maintains that responsibility for content quality lies primarily with language communities themselves, providing the platform as a space for growth and development. However, when communities are inactive or absent, the risk of digital language extinction intensifies.
Conclusion: A Call for Responsible AI Use and Community Engagement
The intersection of AI and minority language preservation presents both opportunities and risks. While AI can accelerate content creation, unchecked reliance on machine translation threatens to degrade linguistic resources and alienate native speakers. Success stories like Inari Saami demonstrate that with committed communities and careful stewardship, Wikipedia can be a powerful tool for language revitalization.
To safeguard endangered languages online, it is imperative to promote responsible AI use, encourage native speaker participation, and prioritize quality over quantity in digital content. Only through such concerted efforts can the rich tapestry of the world’s linguistic heritage be preserved for future generations.
