Building LLMs that speak Twi and Yoruba (and when it is the wrong move)

Episode 23 • with Dr Akua Sarpong

AI & Data

Built in Africa · Episode

Guest

Dr Akua Sarpong

Researcher at Akan NLP Toolkit

AI & Data

May 21, 2026 44:08

Listen on

Spotify Apple Podcasts Google Podcasts Pocket Casts RSS

About this episode

What it actually takes to build production AI for African languages — and when fine-tuning is the wrong answer.

Chapters

Show notes

00:00Why African languages are not low-resource
06:30Tokenization for Twi: a special case
14:00When fine-tuning is worth it
23:00Hallucinations in agglutinative languages
32:00Open data, open models, open critique

#LLM#African languages#Twi#Yoruba#research

Full conversation

Transcript

Auto-generated and lightly edited — let us know about errors.

Adaobi: When most people think 'African languages and LLMs', they think low-resource. Akua, is that the right frame? Dr Sarpong: I think the framing is wrong but the conclusion is sometimes right. African languages are not low-resource in the absolute sense. Yoruba has roughly forty-five million speakers. Twi has about twenty million. There is more spoken Twi in the world than there is spoken Swedish. The data exists. The problem is that the data is not digitized, and what is digitized is in a few narrow registers — Bible translations, news, parliamentary records. The vocabulary of a customer service conversation in Kumasi shares maybe forty percent of its tokens with the Bible translation. So when we say low-resource what we really mean is low-resource in the digital register that matters for the application we want to build. Kwame: That is an important distinction. Walk us through tokenization for Twi. Dr Sarpong: Twi is tonal and agglutinative. A single word can encode subject, tense, aspect, mood, and object. Subword tokenization, which is what most large models use, can do reasonable things with European languages because the morphology is mostly suffixal and not too productive. With Twi, you get a word like 'merekoekyere' which decomposes into me-re-ko-e-kyere — I am going to show you. A naive BPE tokenizer splits that into something like 'mere' 'koe' 'kyere' and the model learns garbage. The fix is either to train a custom tokenizer on a Twi corpus from scratch, or to use a morphological pre-processor that gives the model a sequence of clean morphemes. We have done both. The custom tokenizer is more elegant. The morphological pre-processor is faster to ship and easier to debug. Adaobi: When is fine-tuning worth it? Dr Sarpong: My rough rule is: fine-tune when your task is narrow, your data is rich, and the latency budget is tight. For a customer support classifier on a fintech dataset with twenty thousand labeled examples, fine-tuning a small open model beats prompting a frontier model on quality and latency, and it costs less per request. For a general-purpose assistant where the user might ask anything, fine-tuning is the wrong move. You spend three months making the model worse at everything except your narrow task. Just prompt the frontier model and accept the latency. Kwame: Talk to us about hallucinations. Dr Sarpong: Hallucinations in agglutinative languages have a particular flavor. The model can produce a grammatically valid Twi word that does not exist. Because the morphology is so productive, the model invents words by gluing morphemes together in patterns it has seen. The output is fluent. The output is confident. The output is fictional. We caught a deployed system telling a user that a transaction was 'apieyem' — which is a word the model invented by combining a-pie-yem in a pattern that exists but applied to a root that does not. A native speaker recognized it as nonsense in a second. The fluency masked it from the English-speaking reviewers. So you have to put native speakers in your evaluation loop. There is no shortcut. Adaobi: Last question. What is your view on open versus closed? Dr Sarpong: For African languages, open everything. Open data, open models, open critique. The closed frontier labs are not going to invest in our languages at the depth required. They will skim. They will release benchmarks. They will declare victory. The real work — the work of building tokenizers, evaluation sets, instruction data, and feedback loops — has to be done in the open by people who speak the languages. The good news is that this is happening. Masakhane, GhanaNLP, the Akan NLP Toolkit. The community is real. The output is accelerating. I think in three years the gap between English and the top ten African languages will be smaller than the gap between English and most European languages today. Kwame: Akua, thank you. This has been really clarifying. Dr Sarpong: My pleasure.

Keep listening

Related episodes

All episodes

Ep 24 · Fintech

Scaling mobile money rails to four countries in nine months

with Adwoa Mensah

Adwoa walks us through how Pan-African Payments built one platform for four countries — and the regulatory, engineering, and operational decisions that made it possible.

June 4, 2026·52:14

Ep 22 · Public Sector

Zero-trust in Ghana's public sector

with Nana Yaw Boateng

How Ghana's digital services team moved from network-perimeter security to a zero-trust posture across thirty agencies in eighteen months.

May 7, 2026·47:32

Ep 21 · Engineering practice

Why we open-sourced our fintech reconciliation engine

with Tobi Ade

Tobi explains the business and engineering case for open-sourcing a piece of core financial infrastructure — and what happened next.

April 23, 2026·41:18

Built in Africa

Listen to more 'Built in Africa' episodes

Bi-weekly conversations with the people building serious software from and for Africa. Twenty-four episodes and counting.

Loading…

Building LLMs that speak Twi and Yoruba (and when it is the wrong move)

Episode 23 • with Dr Akua Sarpong

AI & Data

Built in Africa · Episode

Guest

Dr Akua Sarpong

Researcher at Akan NLP Toolkit

AI & Data

May 21, 2026 44:08

Listen on

Spotify Apple Podcasts Google Podcasts Pocket Casts RSS

About this episode

What it actually takes to build production AI for African languages — and when fine-tuning is the wrong answer.

Chapters

Show notes

00:00Why African languages are not low-resource
06:30Tokenization for Twi: a special case
14:00When fine-tuning is worth it
23:00Hallucinations in agglutinative languages
32:00Open data, open models, open critique

#LLM#African languages#Twi#Yoruba#research

Full conversation

Transcript

Auto-generated and lightly edited — let us know about errors.

Keep listening

Related episodes

All episodes

Ep 24 · Fintech

Scaling mobile money rails to four countries in nine months

with Adwoa Mensah

Adwoa walks us through how Pan-African Payments built one platform for four countries — and the regulatory, engineering, and operational decisions that made it possible.

June 4, 2026·52:14

Ep 22 · Public Sector

Zero-trust in Ghana's public sector

with Nana Yaw Boateng

How Ghana's digital services team moved from network-perimeter security to a zero-trust posture across thirty agencies in eighteen months.

May 7, 2026·47:32

Ep 21 · Engineering practice

Why we open-sourced our fintech reconciliation engine

with Tobi Ade

Tobi explains the business and engineering case for open-sourcing a piece of core financial infrastructure — and what happened next.

April 23, 2026·41:18

Built in Africa

Listen to more 'Built in Africa' episodes

Bi-weekly conversations with the people building serious software from and for Africa. Twenty-four episodes and counting.