Role
Independent builder
Language data infrastructure / 2026 / Independent builder
Bilingual Language Data Platform
Crowdsourcing platform for collecting bilingual English-Shona speech and text data. Built to support language tooling, voice interfaces, and low-resource NLP work in Zimbabwe.
Role
Independent builder
Focus
Language data infrastructure
Stack
TypeScript / Next.js / Node.js / PostgreSQL / TailwindCSS / Azure
Stack
6 technologies
TypeScript, Next.js, Node.js
Delivery
1 public endpoints
Exposed through public-facing project links.
Surface
Speech + text collection
A contribution workflow designed to gather bilingual training data.
Constraint
Low-resource language support
Targets a missing data layer that blocks local NLP and voice systems.
A crowdsourcing platform for building a bilingual English-Shona voice assistant through community data collection
Shona is spoken by over 10 million people in Zimbabwe, yet it has virtually zero representation in modern language technology. No Google Translate, no Siri, no Alexa. The reason is simple: there's no training data. No speech corpora, no parallel text datasets, no emotion-labeled audio. Without data, there are no models. Without models, there's no voice for an entire language in the digital world.
Inzwi (meaning "voice" in Shona) is a platform that tackles this from the ground up. Instead of waiting for big tech to notice low-resource African languages, it crowdsources the data directly from native speakers through a gamified web platform — then uses that data to finetune speech and language models.
The platform collects three types of data:
inzwi/
├── frontend/ # Next.js 14 PWA
├── backend/ # Fastify API server
│ ├── routes/ # Auth, recordings, sentences, validations,
│ │ # freeform, WhatsApp, admin, profile
│ ├── services/ # Business logic
│ └── prisma/ # Database schema & migrations
├── shared/ # Shared TypeScript types
└── docs/ # Research proposal, data architecture
Inzwi is my final-year research project at MSU (2026). The goal isn't just a platform — it's a replicable framework for any low-resource language. The crowdsourcing methodology, data pipeline, and model finetuning strategy are designed to be adapted for other African languages facing the same data scarcity.
"Inzwi" means "voice." This project is about giving my language a voice in the digital world. Shona speakers deserve the same access to language technology that English speakers take for granted. Rather than waiting for someone else to build it, I'm building the data infrastructure and models myself — with help from the community.
TypeScript / Next.js / Node.js / PostgreSQL / TailwindCSS / Azure