Auto-generated and lightly edited — let us know about errors.
Adaobi: When most people think 'African languages and LLMs', they think low-resource. Akua, is that the right frame?
Dr Sarpong: I think the framing is wrong but the conclusion is sometimes right. African languages are not low-resource in the absolute sense. Yoruba has roughly forty-five million speakers. Twi has about twenty million. There is more spoken Twi in the world than there is spoken Swedish. The data exists. The problem is that the data is not digitized, and what is digitized is in a few narrow registers — Bible translations, news, parliamentary records. The vocabulary of a customer service conversation in Kumasi shares maybe forty percent of its tokens with the Bible translation. So when we say low-resource what we really mean is low-resource in the digital register that matters for the application we want to build.
Kwame: That is an important distinction. Walk us through tokenization for Twi.
Dr Sarpong: Twi is tonal and agglutinative. A single word can encode subject, tense, aspect, mood, and object. Subword tokenization, which is what most large models use, can do reasonable things with European languages because the morphology is mostly suffixal and not too productive. With Twi, you get a word like 'merekoekyere' which decomposes into me-re-ko-e-kyere — I am going to show you. A naive BPE tokenizer splits that into something like 'mere' 'koe' 'kyere' and the model learns garbage. The fix is either to train a custom tokenizer on a Twi corpus from scratch, or to use a morphological pre-processor that gives the model a sequence of clean morphemes. We have done both. The custom tokenizer is more elegant. The morphological pre-processor is faster to ship and easier to debug.
Adaobi: When is fine-tuning worth it?
Dr Sarpong: My rough rule is: fine-tune when your task is narrow, your data is rich, and the latency budget is tight. For a customer support classifier on a fintech dataset with twenty thousand labeled examples, fine-tuning a small open model beats prompting a frontier model on quality and latency, and it costs less per request. For a general-purpose assistant where the user might ask anything, fine-tuning is the wrong move. You spend three months making the model worse at everything except your narrow task. Just prompt the frontier model and accept the latency.
Kwame: Talk to us about hallucinations.
Dr Sarpong: Hallucinations in agglutinative languages have a particular flavor. The model can produce a grammatically valid Twi word that does not exist. Because the morphology is so productive, the model invents words by gluing morphemes together in patterns it has seen. The output is fluent. The output is confident. The output is fictional. We caught a deployed system telling a user that a transaction was 'apieyem' — which is a word the model invented by combining a-pie-yem in a pattern that exists but applied to a root that does not. A native speaker recognized it as nonsense in a second. The fluency masked it from the English-speaking reviewers. So you have to put native speakers in your evaluation loop. There is no shortcut.
Adaobi: Last question. What is your view on open versus closed?
Dr Sarpong: For African languages, open everything. Open data, open models, open critique. The closed frontier labs are not going to invest in our languages at the depth required. They will skim. They will release benchmarks. They will declare victory. The real work — the work of building tokenizers, evaluation sets, instruction data, and feedback loops — has to be done in the open by people who speak the languages. The good news is that this is happening. Masakhane, GhanaNLP, the Akan NLP Toolkit. The community is real. The output is accelerating. I think in three years the gap between English and the top ten African languages will be smaller than the gap between English and most European languages today.
Kwame: Akua, thank you. This has been really clarifying.
Dr Sarpong: My pleasure.