All Lab Expands Global Language Coverage on Common Crawl: 62 New Languages Added

The internet is vast, but not all languages are equally visible online. Many communities speak languages that have limited digital footprints, and as a result, these languages are often under represented in large scale web archives and datasets used for research and AI development.

As part of our ongoing commitment to linguistic diversity and digital inclusion, All Lab contributed a new set of URLs to the Common Crawl web languages repository.
This effort reflects our belief that every language deserves a place in the global digital record.

What is Common Crawl?

Common Crawl is a nonprofit organization that maintains a free and open archive of web crawl data. The archive contains petabytes of information gathered over more than 15 years.

The “web languages” repository on GitHub is a community driven project that collects URLs for under represented and low resource languages so that Common Crawl can index them.

By contributing new links in lesser known languages such as community blogs, cultural sites, and regional portals, volunteers help ensure that the global web reflects true linguistic diversity rather than being dominated by major languages.

Expanding Digital Access for Under Represented Languages

All Lab recently contributed a curated set of URLs to the Common Crawl web languages project.

Before our involvement, the repository contained 4,452 URLs across 131 languages.
After our contribution, the repository now contains 5,535 URLs across 193 languages.

This means:
1,083 new URLs added
62 newly represented languages

Our submissions included cultural websites, local news pages, regional community content, and language specific portals that were previously missing from the dataset.

Methodology

Members of the African Languages Lab spent several hours searching the web for URLs containing content in 71 African languages. After compiling the initial list, each contributor cross checked another member’s work to ensure accuracy and completeness. The combined dataset was then deduplicated, assigned to the correct language categories, and cleaned by removing invalid or inaccessible URLs. This process ensured that the final list we submitted was consistent, reliable, and representative of the languages involved.

Why This Contribution Matters

Our contribution strengthens the digital presence of under represented languages and supports a more inclusive internet. By expanding Common Crawl’s dataset, we help researchers, AI systems, and communities access richer, more diverse linguistic resources that preserve culture and improve global language technology.

Greater Visibility for Low Resource Languages

Better Multilingual AI and Research

Stronger Digital Cultural Preservation

At All Lab, we believe every language carries value, holding culture, identity, and shared history. By contributing to Common Crawl’s web languages repository, we help ensure all communities are represented in the global digital record. If you care about linguistic diversity and inclusive AI, join us in curating meaningful URLs and strengthening the presence of under represented languages online. Together, we can build a more inclusive internet.

BLOGS & ARTICLES

Stay Up To Date With All Lab

All Lab recently contributed a curated set of URLs to the Common Crawl web languages project.

17 December 2025

Mansa is now integrated into the Phrase ecosystem, advancing our mission to build scalable language technology for Africa’s diverse languages.

12 December 2025

Mansa AI is now live as a native app on Blackbird.io, thanks to a meaningful and highly productive collaboration with the Blackbird team.

3 December 2025

African Languages Lab