TWB Voice 1.0 - Hausa | Mozilla Data Collective

Description

TWB Voice 1.0 - Hausa is the Hausa language portion of the TWB Voice 1.0 multilingual speech corpus, created by CLEAR Global (formerly Translators without Borders). It contains approximately 58 hours of read speech recorded by native Hausa speakers through the TWB Voice platform. The dataset includes 36,665 recordings across train, dev, test, rejected, and pending splits, with transcriptions and speaker demographic metadata (age, gender, education level, country of origin). Audio is in WAV format at 48kHz. The dataset was created to support automatic speech recognition (ASR) development for underrepresented languages, with funding from the Patrick J. McGovern Foundation.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

Data was collected through the TWB Voice platform coordinated by CLEAR Global. Native Hausa speakers read prompted text, and recordings underwent human review by native speakers. This is the Hausa subset of the full TWB Voice 1.0 dataset on HuggingFace.

Data Splits

Split	Recordings	Hours
train	26,996	43.79
dev	2,540	4.03
test	4,591	6.56
rejected	1,265	2.25
pending	1,273	1.48
Total	36,665	58.11

Splits ensure no speaker overlap between train/dev/test while maintaining 80/10/10 duration ratios. Rejected and pending recordings are preserved in separate splits.

Gender Distribution (approved splits)

Gender	Hours
Male	38.82
Female	15.56
Total	54.38

Data Fields

id, sentence, task_id, sentence_source, user_id, age, gender, duration, locale, variant, education_level, country_of_origin, created_at, path, split

File Structure

TWB-Voice-1.0-hau/
├── clips/
│   ├── train/       (26,996 WAV files)
│   ├── dev/         (2,540 WAV files)
│   ├── test/        (4,591 WAV files)
│   ├── rejected/    (1,265 WAV files)
│   └── pending/     (1,273 WAV files)
├── metadata.tsv
├── README.md
└── LICENSE

Audio files are WAV format at 48kHz sample rate.

Fine-tuned Models

Dataset Access Terms

This dataset is subject to Dataset Access Terms in addition to the CC BY-NC 4.0 license. By downloading, users become independent data controllers responsible for complying with applicable data protection laws.

Citation

@dataset{twb_voice_2025,
  title = {TWB Voice 1.0},
  author = {CLEAR Global},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/CLEAR-Global/twb-voice-1.0}
}

Disclaimer

This dataset was created by CLEAR Global with support from the Patrick J. McGovern Foundation.

Please check README.md for more information.

TWB Voice 1.0 - Hausa