Task: ASR
Release Date: 5/14/2026
Format: WAV, TSV
Size: 3.17 GB
Share
TWB Voice 1.0 - Shuwa Arabic is the Shuwa Arabic language portion of the TWB Voice 1.0 multilingual speech corpus, created by CLEAR Global (formerly Translators without Borders). It contains approximately 15 hours of read speech recorded by native Shuwa Arabic speakers through the TWB Voice platform. The dataset includes 8,245 recordings across train, dev, test, rejected, and pending splits, with transcriptions and speaker demographic metadata (age, gender, education level, country of origin). Audio is in WAV format at 48kHz. The dataset was created to support automatic speech recognition (ASR) development for underrepresented languages, with funding from the Patrick J. McGovern Foundation.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
By downloading the data, the user becomes an independent data controller and is responsible for complying with applicable data protection laws (including GDPR), responding to data subject rights requests, and reporting any data breaches to CLEAR Global. Users must implement appropriate technical and organizational security measures. Data must be retained only as long as necessary for the intended purpose. Users must comply with any data deletion requests from CLEAR Global (including deletion from backups and derived datasets). See the Dataset Access Terms for full details.
Forbidden Usage
Attempting to determine the identity of speakers in the dataset. Re-hosting or re-sharing the dataset outside the terms of the license.
Ethical Review
All speakers consented to data collection and open publishing. Speaker identities are anonymized using user IDs. Dataset is gated for compliance with privacy and data protection laws (e.g. GDPR). By downloading, users accept data controller responsibilities under the Dataset Access Terms (https://huggingface.co/datasets/CLEAR-Global/TWB-Voice-dataset-access-terms), including handling data subject rights requests and data breach notifications.
Intended Use
Training and evaluation of automatic speech recognition (ASR) systems for Shuwa Arabic, speaker recognition research, language identification, and speech synthesis/TTS development.
Data was collected through the TWB Voice platform coordinated by CLEAR Global. Native Shuwa Arabic speakers read prompted text, and recordings underwent human review by native speakers. This is the Shuwa Arabic subset of the full TWB Voice 1.0 dataset on HuggingFace.
| Split | Recordings | Hours |
|---|---|---|
| train | 5,702 | 10.03 |
| dev | 657 | 1.24 |
| test | 718 | 1.24 |
| rejected | 134 | 0.28 |
| pending | 1,034 | 1.97 |
| Total | 8,245 | 14.76 |
Splits ensure no speaker overlap between train/dev/test while maintaining 80/10/10 duration ratios. Rejected and pending recordings are preserved in separate splits.
| Gender | Hours |
|---|---|
| Male | 12.40 |
| Female | 0.12 |
| Total | 12.51 |
id, sentence, task_id, sentence_source, user_id, age, gender, duration, locale, variant, education_level, country_of_origin, created_at, path, split
TWB-Voice-1.0-shu/
├── clips/
│ ├── train/ (5,702 WAV files)
│ ├── dev/ (657 WAV files)
│ ├── test/ (718 WAV files)
│ ├── rejected/ (134 WAV files)
│ └── pending/ (1,034 WAV files)
├── metadata.tsv
├── README.md
├── LICENSE
└── DATASET_ACCESS_TERMS.md
Audio files are WAV format at 48kHz sample rate.
This dataset is subject to Dataset Access Terms in addition to the CC BY-NC 4.0 license. By downloading, users become independent data controllers responsible for complying with applicable data protection laws.
@dataset{twb_voice_2024,
title = {TWB Voice 1.0},
author = {CLEAR Global},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/CLEAR-Global/TWB-Voice-1.0}
}
This dataset was created by CLEAR Global with support from the Patrick J. McGovern Foundation.
Please check README.md for more information.
Also published in Hugging Face