License:
CC-BY-NC-4.0
Steward:
CLEAR GlobalTask: ASR
Release Date: 5/8/2026
Format: WAV, TSV
Size: 11.88 GB
Share
TWB Voice 1.0 - Hausa is the Hausa language portion of the TWB Voice 1.0 multilingual speech corpus, created by CLEAR Global (formerly Translators without Borders). It contains approximately 58 hours of read speech recorded by native Hausa speakers through the TWB Voice platform. The dataset includes 36,665 recordings across train, dev, test, rejected, and pending splits, with transcriptions and speaker demographic metadata (age, gender, education level, country of origin). Audio is in WAV format at 48kHz. The dataset was created to support automatic speech recognition (ASR) development for underrepresented languages, with funding from the Patrick J. McGovern Foundation.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
Restrictions/Special Constraints
Non-commercial use only. Commercial use requires explicit permission from CLEAR Global. The dataset is subject to Dataset Access Terms (https://huggingface.co/datasets/CLEAR-Global/TWB-Voice-dataset-access-terms) in addition to the CC BY-NC 4.0 license. By downloading the data, the user becomes an independent data controller and is responsible for complying with applicable data protection laws (including GDPR), responding to data subject rights requests (access, rectification, erasure, portability, objection), and reporting any data breaches to CLEAR Global. Users must implement appropriate technical and organizational security measures (e.g. access controls, encryption). Data must be retained only as long as necessary for the intended purpose. Users must comply with any data deletion requests from CLEAR Global (including deletion from backups and derived datasets). If redistributing, recipients must be bound by the same license and Dataset Access Terms.
Forbidden Usage
No commercial use without explicit permission from CLEAR Global. No redistribution without attribution or without ensuring recipients are bound by the same license and Dataset Access Terms.
Ethical Review
All speakers consented to data collection and open publishing. Speaker identities are anonymized using user IDs. Dataset is gated for compliance with privacy and data protection laws (e.g. GDPR). By downloading, users accept data controller responsibilities under the Dataset Access Terms (https://huggingface.co/datasets/CLEAR-Global/TWB-Voice-dataset-access-terms), including handling data subject rights requests and data breach notifications.
Intended Use
Training and evaluation of automatic speech recognition (ASR) systems for Hausa, speaker recognition research, language identification, and speech synthesis/TTS development.
Data was collected through the TWB Voice platform coordinated by CLEAR Global. Native Hausa speakers read prompted text, and recordings underwent human review by native speakers. This is the Hausa subset of the full TWB Voice 1.0 dataset on HuggingFace.
| Split | Recordings | Hours |
|---|---|---|
| train | 26,996 | 43.79 |
| dev | 2,540 | 4.03 |
| test | 4,591 | 6.56 |
| rejected | 1,265 | 2.25 |
| pending | 1,273 | 1.48 |
| Total | 36,665 | 58.11 |
Splits ensure no speaker overlap between train/dev/test while maintaining 80/10/10 duration ratios. Rejected and pending recordings are preserved in separate splits.
| Gender | Hours |
|---|---|
| Male | 38.82 |
| Female | 15.56 |
| Total | 54.38 |
id, sentence, task_id, sentence_source, user_id, age, gender, duration, locale, variant, education_level, country_of_origin, created_at, path, split
TWB-Voice-1.0-hau/
├── clips/
│ ├── train/ (26,996 WAV files)
│ ├── dev/ (2,540 WAV files)
│ ├── test/ (4,591 WAV files)
│ ├── rejected/ (1,265 WAV files)
│ └── pending/ (1,273 WAV files)
├── metadata.tsv
├── README.md
└── LICENSE
Audio files are WAV format at 48kHz sample rate.
This dataset is subject to Dataset Access Terms in addition to the CC BY-NC 4.0 license. By downloading, users become independent data controllers responsible for complying with applicable data protection laws.
@dataset{twb_voice_2025,
title = {TWB Voice 1.0},
author = {CLEAR Global},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/CLEAR-Global/twb-voice-1.0}
}
This dataset was created by CLEAR Global with support from the Patrick J. McGovern Foundation.
Please check README.md for more information.