TWB Voice 1.0 - Shuwa Arabic | Mozilla Data Collective

Description

TWB Voice 1.0 - Shuwa Arabic is the Shuwa Arabic language portion of the TWB Voice 1.0 multilingual speech corpus, created by CLEAR Global (formerly Translators without Borders). It contains approximately 15 hours of read speech recorded by native Shuwa Arabic speakers through the TWB Voice platform. The dataset includes 8,245 recordings across train, dev, test, rejected, and pending splits, with transcriptions and speaker demographic metadata (age, gender, education level, country of origin). Audio is in WAV format at 48kHz. The dataset was created to support automatic speech recognition (ASR) development for underrepresented languages, with funding from the Patrick J. McGovern Foundation.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Data was collected through the TWB Voice platform coordinated by CLEAR Global. Native Shuwa Arabic speakers read prompted text, and recordings underwent human review by native speakers. This is the Shuwa Arabic subset of the full TWB Voice 1.0 dataset on HuggingFace.

Data Splits

Split	Recordings	Hours
train	5,702	10.03
dev	657	1.24
test	718	1.24
rejected	134	0.28
pending	1,034	1.97
Total	8,245	14.76

Splits ensure no speaker overlap between train/dev/test while maintaining 80/10/10 duration ratios. Rejected and pending recordings are preserved in separate splits.

Gender Distribution (approved splits)

Gender	Hours
Male	12.40
Female	0.12
Total	12.51

Data Fields

id, sentence, task_id, sentence_source, user_id, age, gender, duration, locale, variant, education_level, country_of_origin, created_at, path, split

File Structure

TWB-Voice-1.0-shu/
├── clips/
│   ├── train/       (5,702 WAV files)
│   ├── dev/         (657 WAV files)
│   ├── test/        (718 WAV files)
│   ├── rejected/    (134 WAV files)
│   └── pending/     (1,034 WAV files)
├── metadata.tsv
├── README.md
├── LICENSE
└── DATASET_ACCESS_TERMS.md

Audio files are WAV format at 48kHz sample rate.

Dataset Access Terms

This dataset is subject to Dataset Access Terms in addition to the CC BY-NC 4.0 license. By downloading, users become independent data controllers responsible for complying with applicable data protection laws.

Citation

@dataset{twb_voice_2024,
  title = {TWB Voice 1.0},
  author = {CLEAR Global},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/CLEAR-Global/TWB-Voice-1.0}
}

Disclaimer

This dataset was created by CLEAR Global with support from the Patrick J. McGovern Foundation.

Please check README.md for more information.

Also published in Hugging Face

TWB Voice 1.0 - Shuwa Arabic