Back to Work
πŸ—£οΈ

Inzwi

Bilingual Voice Assistant Platform

Year

2026

Role

Solo Developer

Crowdsourcing platform for building a bilingual English-Shona voice assistant. Community-driven data collection with gamification, audio recording, translation, and WhatsApp integration.

Inzwi

A crowdsourcing platform for building a bilingual English-Shona voice assistant through community data collection

The Problem

Shona is spoken by over 10 million people in Zimbabwe, yet it has virtually zero representation in modern language technology. No Google Translate, no Siri, no Alexa. The reason is simple: there's no training data. No speech corpora, no parallel text datasets, no emotion-labeled audio. Without data, there are no models. Without models, there's no voice for an entire language in the digital world.

The Solution

Inzwi (meaning "voice" in Shona) is a platform that tackles this from the ground up. Instead of waiting for big tech to notice low-resource African languages, it crowdsources the data directly from native speakers through a gamified web platform β€” then uses that data to finetune speech and language models.

The platform collects three types of data:

  1. Text translations β€” English-Shona sentence pairs for bilingual model training
  2. Audio recordings β€” Read speech and emotional speech for STT and TTS training
  3. Validations β€” Community peer review to ensure data quality

How It Works

  1. Contribute β€” Users translate sentences, record audio, or validate others' contributions
  2. Gamification β€” Points, badges, leaderboards, and progress tracking keep contributors engaged
  3. Quality Pipeline β€” Multi-stage validation (community voting + automated audio checks) ensures clean data
  4. Model Training β€” Collected data is used to finetune Whisper (STT), LLaMA (bilingual LLM), XTTS (TTS), and emotion recognition models
  5. WhatsApp Bot β€” Meets users where they are, enabling contribution and interaction via WhatsApp

Key Features

  • Audio Recording Module β€” Web-based recording with client-side quality checks (volume, noise detection)
  • Translation Module β€” Community-driven English ↔ Shona text translation with dialect tagging
  • Validation System β€” Peer validation for audio accuracy and translation quality
  • Emotional Speech Collection β€” Prompted scenarios for 5 emotions (neutral, happy, sad, angry, surprised)
  • Gamification β€” Leaderboards, badges, achievements, and contributor stats
  • WhatsApp Integration β€” Bot for audio collection and freeform conversation via Meta Cloud API
  • Dialect Support β€” Tags for Zezuru, Karanga, Manyika, Ndau, and Korekore dialects
  • PWA β€” Offline-capable progressive web app for low-bandwidth environments
  • Freeform Mode β€” Open-ended conversational data collection alongside structured tasks

Technical Stack

  • Frontend β€” Next.js 14, TypeScript, Tailwind CSS, NextAuth.js
  • Backend β€” Node.js, Fastify, TypeScript, Zod validation
  • Database β€” PostgreSQL with Prisma ORM
  • Storage β€” Azure Blob Storage for audio files
  • WhatsApp β€” Meta Cloud API for bot integration
  • ML Pipeline β€” PyTorch, HuggingFace Transformers (Whisper, LLaMA, XTTS, Wav2Vec2)
  • Deployment β€” Vercel (frontend), VPS (backend)

Architecture

inzwi/
β”œβ”€β”€ frontend/     # Next.js 14 PWA
β”œβ”€β”€ backend/      # Fastify API server
β”‚   β”œβ”€β”€ routes/   # Auth, recordings, sentences, validations,
β”‚   β”‚             # freeform, WhatsApp, admin, profile
β”‚   β”œβ”€β”€ services/ # Business logic
β”‚   └── prisma/   # Database schema & migrations
β”œβ”€β”€ shared/       # Shared TypeScript types
└── docs/         # Research proposal, data architecture

The Bigger Picture

Inzwi is my final-year research project at MSU (2026). The goal isn't just a platform β€” it's a replicable framework for any low-resource language. The crowdsourcing methodology, data pipeline, and model finetuning strategy are designed to be adapted for other African languages facing the same data scarcity.

Target Datasets

  • 50,000–100,000 bilingual sentence pairs
  • 50–100 hours of read Shona speech
  • 10–20 hours of emotion-labeled speech
  • First Shona emotional speech dataset of its kind

Why I Built This

"Inzwi" means "voice." This project is about giving my language a voice in the digital world. Shona speakers deserve the same access to language technology that English speakers take for granted. Rather than waiting for someone else to build it, I'm building the data infrastructure and models myself β€” with help from the community.

Links

Built With

TypeScriptNext.jsNode.jsPostgreSQLTailwindCSSAzure

Want to see more?

View All Projects