License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmq82rl5i012ymk07yii2lzya
Task: TTS
Release Date: 6/10/2026
Format: MP3, TSV
Size: 179.11 MB
Share
This dataset comprises 1,127 high-quality audio recordings of read speech produced by a single Afaan Oromo speaker over 19 sessions. Afaan Oromo (ISO 639-3: orm), also known as Oromo or Oromiffa, is a Cushitic language of the Afroasiatic family and the most widely spoken language in Ethiopia. It is spoken primarily in the Oromia region of Ethiopia, with significant speaker communities in neighbouring regions and in Kenya, Somalia, and the diaspora. Despite being one of the most widely spoken languages in Africa — with an estimated 40 to 50 million speakers — it remains severely under-resourced in terms of digital speech data, making this dataset a significant contribution to natural language processing efforts for the language. Audio files are provided in MP3 format (approx. 184 MB), totalling 3 hours, 20 minutes and 47 seconds of speech. The dataset includes 19 audio/sentence mapping files in TSV format, containing 1,127 aligned audio/sentence pairs in total. Transcriptions follow the Qubee orthographic system, the standardised Latin-based alphabet officially adopted for Afaan Oromo in 1991. The recordings draw on news reports and biographical narratives in Afaan Oromo. These texts reflect contemporary journalistic and narrative registers of the language, offering varied prosodic and lexical diversity for training and evaluating TTS and ASR models. The dataset is intended for research and scientific use in speech technology for Afaan Oromo.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
- For research and scientific use only - You agree that you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of the speakers in the dataset; attempt to clone the voice or train models that imitate the speakers in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
The dataset is suitable for speech-related tasks, in particular Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) for Afaan Oromo. The audio-text alignment in this dataset enables speech synthesis and speech recognition models to be trained or evaluated for the development of more inclusive and representative TTS and ASR tools for Afaan Oromo, a low-resource African language with limited existing digital speech resources despite its status as one of the most widely spoken languages on the continent.
Afaan Oromo (ISO 639-3: orm) is a Cushitic language of the Afroasiatic family and the most widely spoken language in Ethiopia, with an estimated 40 to 50 million speakers worldwide. It is the official working language of the Oromia Regional State and one of the five federal working languages of Ethiopia. Afaan Oromo belongs to the Eastern Cushitic branch and is typologically characterised by agglutinative morphology, verb-final word order (SOV), and a system of grammatical gender (masculine and feminine). The language exhibits phonemic vowel length distinctions, a glottal stop phoneme, and a pitch-accent system.
The recordings in this dataset were produced by a single speaker of Afaan Oromo from Ethiopia. The variety recorded is representative of the standard written and broadcast norm of Afaan Oromo as used in Oromia, which underpins the Qubee orthographic standard. No sub-variety distinctions are encoded in the dataset.
The orthography used in the transcription of the audio recordings follows the Qubee alphabet, the standardised Latin-based writing system officially adopted for Afaan Oromo in 1991 by the Oromia regional government. Qubee replaced the Ethiopic (Ge'ez) script, which had been used for Oromo in some contexts, and is now the universally used writing system for the language in Ethiopia and across the Oromo diaspora.
Qubee is built on the Latin alphabet and augmented with a small number of conventions to represent sounds specific to Afaan Oromo. Vowel length, which is phonemically contrastive in the language, is indicated by doubling the vowel letter (e.g., a vs. aa, e vs. ee, i vs. ii, o vs. oo, u vs. uu). The glottal stop, a phonemically distinctive consonant in Afaan Oromo, is represented by an apostrophe or right single quotation mark (ʼ or '). The digraph dh represents the implosive bilabial-alveolar stop, and ph represents a bilabial fricative. The result is an orthographic system that is accessible to Oromo literacy learners while remaining phonologically faithful to the spoken language, making the transcriptions in this dataset accurate representations of the Afaan Oromo heard in the recordings.
This dataset was compiled as part of a Text-to-Speech data collection initiative for Afaan Oromo. The textual source material consists of news reports and biographical narratives in Afaan Oromo, drawn from publicly available Afaan Oromo media. The speaker read prepared passages drawn from these texts, providing naturalistic and culturally grounded speech data. Recordings were made across 19 sessions and subsequently curated, deduplicated, and aligned with their corresponding transcriptions.
The dataset consists of prompted read speech in Afaan Oromo. The textual source material derives from contemporary news reports and biographical narratives — a register that is representative of standard written and broadcast Afaan Oromo. The recordings offer good prosodic and lexical diversity — including varied sentence lengths, clause structures, and registers — making them suitable for TTS model training and ASR evaluation.
Total size of MP3 audio: approx. 184 MB Total size of TSV mapping files: approx. 268 KB
This dataset comprises audio clips and audio/text mapping files organised across 19 recording sessions. There are 1,127 audio clips in MP3 format, totalling 3 hours, 20 minutes and 47 seconds of speech. The dataset includes 19 audio/text mapping files (mapping.tsv), each containing aligned audio/sentence pairs for the corresponding session, with 1,127 aligned pairs in total. Each TSV file contains the following fields: audio_filename, key, sentence, attempts.
| Audio filename | Sentence |
|---|---|
| fa65400d13683d3fefd43466dfc6e0a9.mp3 | Dargaggeessa Oromoo mana hidhaa taa'ee baratee amma saayintistii addunyaa ta'e- Imala Dr. Jiinenus Fiqaaduu |
| 6feef68e306b45f8c242edd98e0ceb5a.mp3 | 'Ulfinni, guddinni fi kabaji keenya kan ittiin mirkanaa'u gonkumaa kufuu dhabuu keenyaan osoo hin taane yeroo kufaatiin nu mudatu hunda ka'uuf tattaaffii taasisnuun argama,'' falaasama jedhuun hoogganama. |
| 087ebe5361bfa6d2b1e0e1fe5c61820c.mp3 | Jechi kun kan nama falaasamaa beekamaa Chaayinaa Konfiyuushas. |
| b4032abb5a3d6516667c67034b7570af.mp3 | Haleellaa Israa'el gamoo petrokeemikaalaa Iraan irratti raawwateen namoonni shan ajjeefamuu miidiyaan Iraan gabaase |
| 8cf7bd0ce2376b68db28fd90078eebe0.mp3 | Miidiyaaleen mootummaa Iraan haleellaa dhaabbata petrokeemikaalaa Iraan irratti raawwatameen namoonni shan du'uusaanii gabaasaa jiru. |
| b064ac8eefdd10fd39c988aea7288813.mp3 | Tiraamp loltuu akka hin bobbaafne hime, Jappaan 'nu gargaaruu qabdi' jedhe |
| ecb7b9b69a3122c29af9275ae5fd8377.mp3 | Kabbadaa Badhaasee abbaa isaa obbo Badhaasee Irkoo fi haadhaa isaa aaddee Maammee Tulluu irraa bara 1984 A.L.O tti caamsaa 23 |
| 2cdc85929db39fd1840f33fcc466234e.mp3 | godina Shawaa Lixaa Aanaa Xuqur Incinnii Ganda Qonnaan bulaa Naannoo Jidduutti dhalate. |
| 9043b878f2cdc5ab2aa44f6f8db6f5bb.mp3 | Pireezidantiin US Donaald Tiraamp waraannisaanii haleellaa Iraan keessatti gageessuun karooraa olitti milkaahaa jiraachuu ibse. |
| bc9d0f4acc843965e577378774a8910c.mp3 | Israa'el ammoo dhaabbatichi kan Iraan misaa'ela balistikii isheef "qaama murteessaa" ta'e kan ittiin omishtuudha jetteetti. |