License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesTask: TTS
Release Date: 4/18/2026
Format: MP3, TSV
Size: 276.02 MB
Share
This dataset comprises audio recordings of isiXhosa speech aligned with textual transcriptions. The dataset is structured into 24 folders, each containing audio files and a corresponding audio-text mapping file. The audio clips are short, typically ranging from 1 to 13 seconds, and are suitable for training and evaluating Text-to-Speech (TTS) systems. The dataset follows a structured format where each audio file is paired with its corresponding transcription in a tab-separated mapping file. The textual content used in this dataset originates from written isiXhosa sources published on the indigenous-language blogging platform IndigenousBlogs (https://indigenousblogs.com/xh/), which hosts original content authored by isiXhosa-speaking bloggers across a range of topics, including narrative texts, opinion pieces, cultural commentary, and everyday informational content. These texts were segmented into short utterances suitable for read speech and TTS modelling.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
- For research and scientific use only - You agree not to re-host or redistribute this dataset
Forbidden Usage
You agree not to use the data for: - Generative AI - Voice cloning or speaker imitation - Reproduction, duplication, modification, or redistribution - Commercial use without explicit permission
Intended Use
This dataset is intended for the training and evaluation of Text-to-Speech (TTS) systems for the isiXhosa language. It aims to support: - Language technology development for one of South Africa's major indigenous languages - Development of speech technologies for under-resourced African languages - Educational applications in multilingual contexts - Research in low-resource and African language speech synthesis
isiXhosa (also known as Xhosa) is a Bantu language belonging to the Nguni subgroup of the Southern Bantu languages. It is one of the most widely spoken indigenous languages in South Africa and is recognized as one of the eleven official languages of the Republic of South Africa. It is also spoken in parts of Lesotho, Zimbabwe, and other southern African communities.
isiXhosa is spoken by approximately 8 to 10 million first-language speakers, with several million additional second-language speakers, making it one of the major indigenous languages of southern Africa. It is the second most commonly spoken home language in South Africa after isiZulu.
The language is most strongly concentrated in the Eastern Cape province, which is considered its historical heartland, as well as in the Western Cape (particularly in and around Cape Town), Gauteng, and the Free State. isiXhosa has significant cultural visibility through literature, music, broadcasting, and film, and is the home language of many prominent South African public figures. It is used in education, government, media, and everyday communication, and it plays a central role in the cultural and linguistic identity of the amaXhosa people.
isiXhosa exhibits regional and sociolectal variation, though all varieties are broadly mutually intelligible. Linguists and speakers recognize several dialectal groupings within the Xhosa cluster:
Eastern Cape varieties include:
Gcaleka — spoken in the former Transkei region; considered by many speakers to be a prestige or "deep" variety of isiXhosa
Ngqika (Rharhabe) — spoken in the central and western Eastern Cape
Thembu — spoken in the Thembu traditional territory around Mthatha and surrounding areas
Bomvana, Mpondo, Mpondomise, Xesibe, and Bhaca — additional sub-varieties spoken across different traditional communities of the Eastern Cape, each showing distinct lexical and phonological features
Urban varieties:
Urban isiXhosa (particularly as spoken in Cape Town, Gqeberha / Port Elizabeth, East London, and Johannesburg) reflects contact with English, Afrikaans, and other South African languages, and often incorporates code-switching and loanwords.
The variety represented in this dataset reflects the standard written register of isiXhosa as commonly used in contemporary digital publications, education, and media, intelligible across regional varieties.
isiXhosa is written using a standardized Latin-based orthography developed initially by nineteenth-century missionaries and subsequently refined by South African language authorities. The orthography is broadly phonemic and is officially codified; it is taught in schools and used in government, publishing, broadcasting, and online media.
Key orthographic features of written isiXhosa:
isiXhosa is famously characterized by its use of three click consonants, represented in the orthography by the letters c (dental click), q (alveolar click), and x (lateral click). These may be combined with other letters to indicate voicing, aspiration, or nasalization (e.g., "gc", "nq", "nx", "xh").
Nasalized and breathy-voiced sounds are represented with digraphs (e.g., "hl", "dl", "tsh", "ty").
Words are typically written conjunctively, meaning that grammatical prefixes, stems, and some suffixes are joined into a single orthographic word, resulting in long word forms typical of Bantu conjunctive orthographies.
Vowel length and tone are generally not marked in the standard orthography, though they are phonologically significant.
The transcriptions in this dataset follow the standard modern isiXhosa orthography used in contemporary written publications.
isiXhosa is an agglutinative Bantu language with a rich morphological system. Key grammatical features include:
Noun class system:
isiXhosa organizes nouns into a system of approximately 15 noun classes (traditionally numbered following Bantu convention), each marked by a characteristic prefix (e.g., "um-/aba-" for human singular/plural, "isi-/izi-" for class 7/8, "ili-/ama-" for class 5/6).
Agreement markers derived from noun-class prefixes attach to verbs, adjectives, possessives, demonstratives, and other dependents, producing extensive concordial agreement throughout the clause.
Verbal morphology:
The verb is highly agglutinative and can host subject markers, object markers, tense/aspect/mood markers, negation, extensions (applicative, causative, passive, reciprocal, etc.), and a final vowel, all within a single word.
Tense and aspect distinctions include present, past (recent and remote), future, perfective, progressive, habitual, and others, expressed through combinations of prefixes, suffixes, and auxiliaries.
Phonology:
isiXhosa is notable for its three click consonants (dental, alveolar, and lateral), inherited from historical contact with neighboring Khoisan languages.
It is a tonal language with two underlying tones (high and low), though tone is not marked in the orthography.
Word order:
Basic word order is Subject–Verb–Object (SVO), though rich agreement and noun-class marking permit considerable flexibility for information-structural purposes.
The textual material in this dataset originates from isiXhosa-language blog posts published on IndigenousBlogs (https://indigenousblogs.com/xh/), a platform dedicated to promoting writing and digital content in indigenous African languages. The posts cover a range of topics including narrative and reflective writing, cultural and social commentary, informational content, and everyday discourse. The texts were segmented into short utterances suitable for read speech and used as prompts for audio recording sessions.
This dataset is derived from prompted read speech. The speaker read aloud pre-written isiXhosa texts drawn from blog-style narrative, reflective, and informational sources. The content covers a range of registers and everyday topics typical of written isiXhosa in contemporary digital media, including personal narratives, cultural commentary, and general discourse.
The dataset has been structured as segmented, read-style speech suitable for speech synthesis tasks.
The dataset is composed of 24 folders containing audio clips and corresponding mapping files.
Each folder contains between 10 and 210 audio files. Individual audio clips typically range from 1 to 13 seconds in duration.
Folder-level durations range from approximately 1 minute 13 seconds to over 31 minutes of audio.
The dataset represents a total of 2,135 audio files with a combined duration of approximately 4 hours 56 minutes and 26 seconds of segmented isiXhosa speech.
A detailed breakdown of durations and file counts per folder is provided below.
| Folder | Files | Duration |
|---|---|---|
| isiXhosa_tts_dataset_26clips_256s_20260401-0003 | 31 | 2m 42s |
| isiXhosa_tts_dataset2_22clips_141s_20260401-0049 | 22 | 1m 50s |
| isiXhosa_tts_dataset3_20clips_126s_20260401-0059 | 20 | 1m 36s |
| isiXhosa_tts_dataset4_51clips_383s_20260401-0123 | 51 | 5m 16s |
| isiXhosa_tts_dataset6_10clips_79s_20260401-0137 | 10 | 1m 13s |
| isiXhosa_tts_dataset7_24clips_134s_20260401-0154 | 24 | 1m 46s |
| isiXhosa_tts_dataset8_20clips_151s_20260401-0208 | 20 | 2m 09s |
| isiXhosa_tts_dataset9_24clips_263s_20260401-0220 | 24 | 3m 09s |
| isiXhosa_tts_dataset13_210clips_2429s_20260401-2335 | 210 | 31m 55s |
| isiXhosa_tts_dataset14_114clips_1281s_20260402-0046 | 114 | 16m 49s |
| isiXhosa_tts_dataset15_206clips_2348s_20260402-0212 | 206 | 30m 10s |
| isiXhosa_tts_dataset16_202clips_2534s_20260402-1118 | 202 | 31m 15s |
| isiXhosa_tts_dataset17_50clips_621s_20260402-1244 | 50 | 7m 49s |
| isiXhosa18_tts_dataset_80clips_617s_20260414-1902 | 80 | 8m 15s |
| isiXhosa19_tts_dataset_133clips_774s_20260415-0014 | 133 | 11m 20s |
| isiXhosa20_tts_dataset_89clips_788s_20260415-0058 | 89 | 11m 27s |
| isiXhosa21_tts_dataset_167clips_1433s_20260415-0246 | 167 | 20m 16s |
| isiXhosa22_tts_dataset_137clips_1185s_20260415-2019 | 137 | 17m 24s |
| isiXhosa24_tts_dataset_178clips_1512s_20260415-2058 | 178 | 22m 30s |
| isiXhosa25_tts_dataset_105clips_916s_20260415-2121 | 105 | 12m 50s |
| isiXhosa26_tts_dataset_68clips_899s_20260416-2132 | 68 | 13m 35s |
| isiXhosa27_tts_dataset_80clips_1170s_20260416-2159 | 80 | 15m 26s |
| isiXhosa28_tts_dataset_51clips_917s_20260416-2256 | 51 | 13m 11s |
| isiXhosa29_tts_dataset_63clips_1025s_20260416-2344 | 63 | 12m 24s |
| GRAND TOTAL | 2,135 | 4h 56m 26s |
Each folder in the dataset contains:
A collection of audio files in MP3 format
A tab-separated mapping file linking each audio file to its transcription
Each line in the mapping file follows the format:
audio_filename.mp3 key sentence attempts
The dataset is designed for TTS pipelines requiring paired audio-text data.
Below are representative entries drawn from one of the mapping files in the dataset, illustrating the pairing between audio filenames and their isiXhosa transcriptions:
1d0f81e4b818f2d1b13ef46970c3e7f8.mp3 | Amadoda amabini aseSwatini abanjelwe intsangu eMbhashe
783489c85c0221e44b0fa0c13f45202f.mp3 | Ukubanjwa kwabo kwenzeke ngoLwesithathu ngamagosa omthetho kuMasipala waseMbhashe.
82ea41f397024a0deb5b27c3321111ea.mp3 | Amadoda amabini aseSwatini aneminyaka ephakathi kwamashumi amabini anesibini (22) namashumi amathathu anesibini (32) abanjwe emva kokufunyanwa enentsangu nemali kwilokishi yaseGovan Mbeki, eMbhashe.
09f59fc3bbadea2519f37169b811343c.mp3 | utshilo umasipala kwingxelo emfutshane.
dcf6d9a6e31c305ac0a545f2b9c9172e.mp3 | IBotswana ibuyisele isithintelo sokungeniswa kwemifuno yangaphandle
44072c00390ff08571ed33d4f5c1526e.mp3 | Malema: 'Nokuba ndingavalelwa ejele, yona imibono soze icime'
0f0cac7581d449fde548a863e9c6cfcc.mp3 | Uthi ikamva leEFF ligqamile ekhona engekho akukho nto inokuze itshabalalise lo mbutho.
7c6c37f683b46e5794575c7e63f58921.mp3 | UMalema ugwetywe iminyaka emihlanu entolongweni
970e7ac45be16a233f915ed1baf694b0.mp3 | Ityala liyaqhuba ukusukela ngentsimbi yethoba kusasa ngoLwesine.
543222bfe9a3768f5493274645f9c922.mp3 | "Iphupha lam kukubona iimpahla zabo zithengiswa ziivenkile ezinkulu zempahla," utshilo.