Every quarter, a new client asks us the same question: can we build a Twi-speaking customer support agent? Or a Yoruba document classifier? Or a Swahili medical triage assistant? The answer is almost always yes, but the path to yes is far less obvious than vendor demos would suggest. Off-the-shelf models, even the largest frontier systems, range from passable to embarrassing on African languages. The gap between English performance and Yoruba performance on the same task is rarely smaller than fifteen points on any benchmark we trust.
The data problem is real, but not what you think
The conventional wisdom is that low-resource languages lack data. That is partially true. What is more true is that the data that does exist is mostly the wrong kind. Bible translations, parliamentary proceedings, and news scrapes dominate the public corpora. None of that prepares a model for a customer asking about a failed mobile money transfer using a code-switched mix of Twi, English, and Pidgin in a single sentence. Real African conversation lives in WhatsApp threads, USSD logs, and call center transcripts. Almost none of it is in any pretraining corpus.
Our practice now is to start every engagement with a two-week data audit. We collect the client's actual conversational data, anonymize it, and use it to build an evaluation set before we write a single line of model code. The eval set is more valuable than the model. It outlives every model migration.
Fine-tuning versus retrieval versus prompting
For most enterprise use cases, the right answer is not what the AI Twitter discourse suggests. We have shipped production systems with all three approaches, and the pattern is clear. Prompting alone works for English-heavy tasks even on African contexts. Retrieval-augmented generation works when the knowledge is structured and the language is a thin wrapper. Fine-tuning is only worth the operational overhead when the linguistic gap is wide and the volume justifies the cost.
// Our default evaluation harness for a new language deployment.
const rubric = {
intent_accuracy: 0.85, // Did we understand the user?
faithfulness: 0.90, // Did we avoid hallucinating?
code_switch_tolerance: 0.80, // Did mixed-language inputs survive?
toxicity_rate: 0.001, // Did we stay safe?
latency_p95_ms: 1200,
};
async function gate(model: Model, evalSet: EvalCase[]) {
const scores = await runEval(model, evalSet);
return Object.entries(rubric).every(
([k, target]) => scores[k] >= target
);
}“The hardest part of African-language AI is not the model. It is admitting that the eval you inherited from English benchmarks does not measure anything that matters here.”
Deployment realities
- Latency budgets in Lagos look different than latency budgets in Frankfurt. Plan for regional inference.
- USSD remains the dominant channel for many users. Your LLM output must compress into 160-character chunks gracefully.
- Voice is closing fast. Twi and Yoruba speech-to-text are within a year of being production-viable for support workflows.
- Always ship with a human handoff path. The cost of a wrong answer in a fintech context is too high for full automation.
We are cautiously optimistic. The frontier is moving in our favor, and the data partnerships emerging across the continent will accelerate the next eighteen months. If you are evaluating a vendor or planning your own build, demand to see their evaluation set. If they cannot show you one, they do not have a product. They have a demo.
