Read Speech in Kenyan Swahili (6h)

Description

A single-speaker read speech dataset in Kenyan Swahili, produced as part of CLEAR Global's Gamayun Language Data Kits initiative. The dataset contains 4,700 pre-segmented utterances (~6 hours, 21,852 seconds) recorded by an anonymous male Kenyan speaker. Sentences were prompted from a script: Swahili translations of general-domain English sentences sourced from the Tatoeba repository. The same sentence set is used in CLEAR Global's Gamayun Swahili–English parallel text kit. The archive includes WAV audio files organised by recording session and a metadata TSV with transcriptions, file paths, and durations.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

Non-commercial use only. Attribution to CLEAR Global is required.

Forbidden Usage

Commercial use of any kind. Attempting to determine the identity of the speaker. Voice cloning or Text-to-speech that match the characteristics of the speaker. Re-hosting or re-sharing this dataset without CLEAR Global's explicit permission. Usage without attribution to CLEAR Global.

This dataset is part of CLEAR Global's Gamayun Language Data Kits initiative, which develops open-source language resources for under-resourced languages used in humanitarian contexts. Source English sentences were selected from Tatoeba using a frequency-based algorithm documented in the corepus-gen repository, and translated by the CLEAR Global translator community.

The audio reflects natural prompted read speech and is not optimised for TTS synthesis. The speaker is an anonymous male Kenyan Swahili speaker.

Citation

If you use this data, please cite Gamayun – Language Technology for Humanitarian Response:

@inproceedings{oktem2020gamayun,
  title     = {Gamayun -- Language Technology for Humanitarian Response},
  author    = {Öktem, Alp and Albayk Jaam, Muhannad and DeLuca, Eric and Tang, Grace},
  booktitle = {2020 IEEE Global Humanitarian Technology Conference (GHTC)},
  year      = {2020},
  address   = {Virtual},
  m {October 29 -- November 1}
}

Please check README.md for more information.

Description

Specifics

Considerations

Processes

Metadata

Citation