When a Tier-1 telco in Ghana asked us to replace their legacy mobile money switch, the brief sounded simple on paper: keep the API surface stable, reduce per-transaction latency below 120ms at the 99th percentile, and survive a market that doubles in volume every eighteen months. The reality was anything but simple. The legacy stack had grown organically over a decade, accumulating bespoke fee tables, regulator-mandated audit trails, and a quiet web of dependencies on a single Oracle RAC cluster that was running out of room on its raised floor.
The throughput problem
Our first instinct was to measure before we touched anything. We instrumented a shadow replica with OpenTelemetry, replayed two weeks of production traffic, and watched the heat-map light up. The bottleneck was not, as the original team suspected, the database. It was the synchronous fan-out to four downstream systems: KYC, fraud, ledger, and notification. Every transaction blocked on all four. When any one of them paused for GC or a network hiccup, the entire switch slowed down.
We redesigned around an event log. Kafka became the spine, with three topics serving the hot path: transaction-requested, transaction-authorized, and transaction-settled. The switch itself shrank to a thin authorization service that wrote to the log and returned a synchronous decision in under 40ms. Everything downstream became a subscriber that could fall behind without blocking the customer.
// Authorization hot path — keep this lean.
async function authorize(req: TxRequest): Promise<TxDecision> {
const account = await ledger.lockBalance(req.payerId);
if (account.available < req.amount) return reject("INSUFFICIENT_FUNDS");
const decision = await risk.scoreInline(req, { timeoutMs: 25 });
if (decision.deny) return reject(decision.reason);
await kafka.publish("transaction-authorized", req.id, req);
return approve(req.id);
}Lessons that travel
If you are building payment rails on the continent, three lessons from this engagement apply almost universally. First, the regulator is a stakeholder, not a constraint. We invited the central bank's tech team to two of our architecture reviews. They flagged a reporting gap that would have cost us a six-week remediation if discovered post-launch. Second, the agent network is your real-world load test. Telco agents have their own peak-hour patterns, and they cluster geographically. Your read replicas should follow your agents, not your office. Third, idempotency is not optional. Every retry, every double-tap, every dropped USSD session translates to a duplicate request somewhere in your stack.
“We stopped thinking about transactions as RPC calls and started thinking about them as facts in a log. That single mental shift unlocked the throughput we needed.”
What we would do differently
- Invest in synthetic load earlier — we waited until week ten and lost a sprint catching up.
- Treat the reconciliation report as a first-class product, not a back-office artifact.
- Standardize on a single tracing context across mobile, web, and USSD entry points from day one.
- Budget for the regulator-facing audit log as its own service with its own SLA.
The switch went live on a Tuesday at 02:00 GMT. By 09:00 we had cleared four million authorized transactions with a P99 of 87ms. By Friday the engineering team was already planning the next milestone: cross-border interoperability with two neighboring markets. The platform was no longer the bottleneck. The market was.
