Unlocking AI for Africa: Massive Dataset for African Languages Revealed (2025)

Imagine a world where artificial intelligence can't speak your language—leaving billions of people out of the digital conversation entirely! That's the stark reality facing speakers of African languages today, as AI tools from giants like ChatGPT, DeepSeek, Siri, and Google Assistant are predominantly built and trained in languages from the Global North, such as English, Chinese, or various European tongues. But here's where it gets controversial: Is this just an oversight, or does it reflect deeper inequities baked into the tech world? Let's dive in and explore this eye-opening initiative that's fighting to change the game.

In stark contrast, African languages are severely underrepresented on the internet and in digital spaces. A dedicated group of African computer scientists, linguists, language experts, and other professionals have banded together to rectify this imbalance by training AI systems in African languages. The African Next Voices project, backed primarily by the Gates Foundation (with additional support from Meta), and collaborating with a vast network of African universities and organizations, has just unveiled what experts believe is the most extensive dataset of African languages for AI purposes to date.

The Conversation team reached out to the project's team members based in Kenya, Nigeria, and South Africa for insights into their groundbreaking work.

Why does language play such a pivotal role in AI development? Think about it: Language is our primary way of connecting with others, seeking assistance, and preserving shared meanings within our communities. We rely on it to structure intricate ideas and exchange thoughts. It's also the bridge we use to communicate our needs to AI systems—and to evaluate if they've truly grasped what we meant. As AI applications surge across fields like education, healthcare, and agriculture, these systems are powered by vast amounts of linguistic data, forming what's known as large language models or LLMs. Unfortunately, these models are currently available in only a handful of the world's languages.

Languages aren't just about words; they encapsulate culture, values, and indigenous knowledge. Without AI that understands our tongues, it struggles to interpret our intentions accurately, eroding trust and making it hard to verify responses. In essence, language is the lifeline for AI-human interaction—if it's missing, AI can't engage with us meaningfully, and vice versa. Developing AI in local languages is crucial for making it truly effective for everyone. And this is the part most people miss: By focusing AI on a limited set of languages, we're overlooking the vast majority of human cultures, histories, and knowledge bases, potentially impoverishing global innovation.

Why are African languages so underrepresented, and what are the repercussions for AI? The evolution of languages is deeply linked to human histories. Many communities affected by colonialism and imperial rule saw their native tongues sidelined and underdeveloped compared to colonial languages. As a result, African languages are rarely documented, especially online, leading to a shortage of high-quality digitized text and audio for training and testing robust AI models. This scarcity stems from long-standing policies that favored colonial languages in education, media, and governance.

But it's not just about data availability—basic tools are also lacking. Do we have comprehensive dictionaries, specialized terminologies, or glossaries for these languages? The answer is often no, and challenges like incompatible keyboards, missing fonts, absent spellcheckers, and tokenizers (tools that split text into manageable parts for AI to process) further complicate matters. Add in orthographic variations (regional spelling differences), tone markings, and the incredible diversity of dialects, and the costs of building usable datasets skyrocket. The outcome? AI that underperforms or even poses safety risks, with frequent mistranslations, inaccurate transcriptions, and a basic failure to comprehend African languages.

For a real-world example, consider how this plays out in South Africa, where languages like isiZulu and isiXhosa are getting an AI makeover to improve accessibility, as highlighted in a recent News24 article (https://www.news24.com/citypress/news/isizulu-isixhosa-and-afrikaans-get-ai-makeover-20251029-0626). In practical terms, this exclusion robs countless Africans of access to global news, educational resources, health information, and the efficiency boosts AI can provide—all in their mother tongues. When a language isn't included in the data, its speakers are essentially erased from the AI products, rendering them unsafe, impractical, or biased. This widens the digital divide, sidelining millions and hindering essential services like healthcare delivery.

What steps is your project taking to address this gap—and how? Our core goal is to gather speech data for automatic speech recognition, or ASR for short. To clarify for beginners, ASR is a vital technology for oral languages, transforming spoken words into written text—think of it as the engine behind voice-to-text features in apps. Our broader vision is to investigate how ASR data is collected and determine the volume required to develop effective tools, while sharing our findings across various regions.

The data we're compiling is intentionally varied: It includes spontaneous conversations and scripted readings, covering domains like daily chats, health, finance, and farming. We source it from people of all ages, genders, and education levels to ensure representation. Every recording is made with explicit informed consent, fair pay for participants, and clear terms regarding data rights. We then transcribe everything using language-tailored guidelines and conduct extensive technical validations.

In Kenya, via the Maseno Centre for Applied AI, we're collecting voice samples for five languages, spanning major groups like Nilotic (Dholuo, Maasai, and Kalenjin), Cushitic (Somali), and Bantu (Kikuyu). In Nigeria, through Data Science Nigeria, we're gathering speech in five prevalent languages: Bambara, Hausa, Igbo, Nigerian Pidgin, and Yoruba—with RobotsMali joining us for Bambara—to mirror genuine community usage. Meanwhile, in South Africa, collaborating with the Data Science for Social Impact lab, we've recorded seven local languages to honor the nation's linguistic richness: isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele, and Tshivenda.

Crucially, this effort builds on the work of pioneers like the Masakhane Research Foundation network, Lelapa AI, Mozilla Common Voice, EqualyzAI, and countless other groups and individuals who've been trailblazing African language AI, data, and tools. Each initiative reinforces the others, fostering a thriving ecosystem that brings African languages into the AI era.

How can these efforts be practically applied? The datasets and models will enable features like local-language captioning for media, voice assistants tailored to agriculture and healthcare, and multilingual support in call centers. They'll also serve as archives for cultural preservation, bridging text and speech resources through larger, balanced, public datasets.

These models won't just be lab experiments—they'll power real-world tools like chatbots, educational apps, and community services. The potential extends beyond data to entire tool ecosystems, including spellcheckers, dictionaries, translation systems, and summary generators, ensuring African languages thrive in digital environments. Ultimately, we're combining ethically sourced, top-notch speech data at scale with advanced models, empowering people to interact naturally with AI in the languages they use every day.

What's on the horizon for the project? This initiative has focused on voice data for specific languages, but what about the rest? And what of complementary technologies like machine translation or grammar checking? We'll keep expanding to more languages, crafting data and models that authentically capture how Africans speak. We're emphasizing smaller, energy-efficient models suited to local contexts for better accuracy.

The big challenge ahead is integration: weaving these elements together so African languages appear in everyday platforms, not isolated demos. A key takeaway from this and similar projects is that data collection is merely the starting point—the real impact comes from benchmarking data for reusability and connecting it to active communities.

For us, the future involves linking our ASR benchmarks to other African initiatives. Sustainability is paramount too: Providing ongoing access to computing resources, educational materials, and licensing agreements (such as the Nwulite Obodo Open Data License or Esethu Framework) for students, researchers, and innovators.

In the long run, our goal is to democratize choice—imagine a farmer, teacher, or entrepreneur using AI effortlessly in isiZulu, Hausa, or Kikuyu, rather than being forced into English or French.

Now, here's where it gets truly thought-provoking: Do you think initiatives like this are enough to bridge the AI language divide, or are there underlying issues—like corporate priorities favoring dominant languages—that need systemic overhaul? Is the push for African language AI a noble step toward equity, or might it inadvertently create new divides within global tech? Share your thoughts in the comments—do you agree this is a game-changer, or disagree and believe more radical changes are required? Let's discuss!

  • Marivate is chair of data science, professor of computer science, director AfriDSAI at the University of Pretoria; Adebara is assistant professor at the University of Alberta; Wanzare is a lecturer and chair of the Department of Computer Science at Maseno University
Unlocking AI for Africa: Massive Dataset for African Languages Revealed (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Trent Wehner

Last Updated:

Views: 6055

Rating: 4.6 / 5 (76 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Trent Wehner

Birthday: 1993-03-14

Address: 872 Kevin Squares, New Codyville, AK 01785-0416

Phone: +18698800304764

Job: Senior Farming Developer

Hobby: Paintball, Calligraphy, Hunting, Flying disc, Lapidary, Rafting, Inline skating

Introduction: My name is Trent Wehner, I am a talented, brainy, zealous, light, funny, gleaming, attractive person who loves writing and wants to share my knowledge and understanding with you.