License:
CC-BY-NC-4.0
Steward:
CLEAR GlobalTask: ASR
Release Date: 4/24/2026
Format: WAV, TSV
Size: 1.59 GB
Share
A single-speaker read speech dataset in Kenyan Swahili, produced as part of CLEAR Global's Gamayun Language Data Kits initiative. The dataset contains 4,700 pre-segmented utterances (~6 hours, 21,852 seconds) recorded by an anonymous male Kenyan speaker. Sentences were prompted from a script: Swahili translations of general-domain English sentences sourced from the Tatoeba repository. The same sentence set is used in CLEAR Global's Gamayun Swahili–English parallel text kit. The archive includes WAV audio files organised by recording session and a metadata TSV with transcriptions, file paths, and durations.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
Non-commercial use only. Attribution to CLEAR Global is required.
Forbidden Usage
Commercial use of any kind. Attempting to determine the identity of the speaker. Voice cloning or Text-to-speech that match the characteristics of the speaker. Re-hosting or re-sharing this dataset without CLEAR Global's explicit permission. Usage without attribution to CLEAR Global.
Ethical Review
Recordings were made with the full knowledge and consent of the speaker. The speaker's identity is kept anonymous. No personally identifying information is included in the dataset.
Intended Use
Development of automatic speech recognition (ASR) systems for Kenyan Swahili; speech research for under-resourced language varieties; related NLP tasks such as language modelling and pronunciation lexicon development.
This dataset is part of CLEAR Global's Gamayun Language Data Kits initiative, which develops open-source language resources for under-resourced languages used in humanitarian contexts. Source English sentences were selected from Tatoeba using a frequency-based algorithm documented in the corepus-gen repository, and translated by the CLEAR Global translator community.
The audio reflects natural prompted read speech and is not optimised for TTS synthesis. The speaker is an anonymous male Kenyan Swahili speaker.
If you use this data, please cite Gamayun – Language Technology for Humanitarian Response:
@inproceedings{oktem2020gamayun,
title = {Gamayun -- Language Technology for Humanitarian Response},
author = {Öktem, Alp and Albayk Jaam, Muhannad and DeLuca, Eric and Tang, Grace},
booktitle = {2020 IEEE Global Humanitarian Technology Conference (GHTC)},
year = {2020},
address = {Virtual},
m {October 29 -- November 1}
}
Please check README.md for more information.