Afaan Oromoo-TTS-Dataset

Language

Afaan Oromo (ISO 639-3: orm) is a Cushitic language of the Afroasiatic family and the most widely spoken language in Ethiopia, with an estimated 40 to 50 million speakers worldwide. It is the official working language of the Oromia Regional State and one of the five federal working languages of Ethiopia. Afaan Oromo belongs to the Eastern Cushitic branch and is typologically characterised by agglutinative morphology, verb-final word order (SOV), and a system of grammatical gender (masculine and feminine). The language exhibits phonemic vowel length distinctions, a glottal stop phoneme, and a pitch-accent system.

Variants

The recordings in this dataset were produced by a single speaker of Afaan Oromo from Ethiopia. The variety recorded is representative of the standard written and broadcast norm of Afaan Oromo as used in Oromia, which underpins the Qubee orthographic standard. No sub-variety distinctions are encoded in the dataset.

Alphabet

The orthography used in the transcription of the audio recordings follows the Qubee alphabet, the standardised Latin-based writing system officially adopted for Afaan Oromo in 1991 by the Oromia regional government. Qubee replaced the Ethiopic (Ge'ez) script, which had been used for Oromo in some contexts, and is now the universally used writing system for the language in Ethiopia and across the Oromo diaspora.

Qubee is built on the Latin alphabet and augmented with a small number of conventions to represent sounds specific to Afaan Oromo. Vowel length, which is phonemically contrastive in the language, is indicated by doubling the vowel letter (e.g., a vs. aa, e vs. ee, i vs. ii, o vs. oo, u vs. uu). The glottal stop, a phonemically distinctive consonant in Afaan Oromo, is represented by an apostrophe or right single quotation mark (ʼ or '). The digraph dh represents the implosive bilabial-alveolar stop, and ph represents a bilabial fricative. The result is an orthographic system that is accessible to Oromo literacy learners while remaining phonologically faithful to the spoken language, making the transcriptions in this dataset accurate representations of the Afaan Oromo heard in the recordings.

Source

This dataset was compiled as part of a Text-to-Speech data collection initiative for Afaan Oromo. The textual source material consists of news reports and biographical narratives in Afaan Oromo, drawn from publicly available Afaan Oromo media. The speaker read prepared passages drawn from these texts, providing naturalistic and culturally grounded speech data. Recordings were made across 24 sessions and subsequently curated, deduplicated, and aligned with their corresponding transcriptions.

Domain

The dataset consists of prompted read speech in Afaan Oromo. The textual source material derives from contemporary news reports and biographical narratives — a register that is representative of standard written and broadcast Afaan Oromo. The recordings offer good prosodic and lexical diversity — including varied sentence lengths, clause structures, and registers — making them suitable for TTS model training and ASR evaluation.

Size

Total size of MP3 audio: approx. 266 MB Total size of TSV mapping files: approx. 392 KB

Structure

This dataset comprises audio clips and audio/text mapping files organised across 24 recording sessions. There are 1,737 audio clips in MP3 format, totalling 4 hours, 51 minutes and 7 seconds of speech. The dataset includes 24 audio/text mapping files (mapping.tsv), each containing aligned audio/sentence pairs for the corresponding session, with 1,737 aligned pairs in total. Each TSV file contains the following fields: audio_filename, key, sentence, attempts.

Sample

Audio filename	Sentence
fa65400d13683d3fefd43466dfc6e0a9.mp3	Dargaggeessa Oromoo mana hidhaa taa'ee baratee amma saayintistii addunyaa ta'e- Imala Dr. Jiinenus Fiqaaduu
6feef68e306b45f8c242edd98e0ceb5a.mp3	'Ulfinni, guddinni fi kabaji keenya kan ittiin mirkanaa'u gonkumaa kufuu dhabuu keenyaan osoo hin taane yeroo kufaatiin nu mudatu hunda ka'uuf tattaaffii taasisnuun argama,'' falaasama jedhuun hoogganama.
087ebe5361bfa6d2b1e0e1fe5c61820c.mp3	Jechi kun kan nama falaasamaa beekamaa Chaayinaa Konfiyuushas.
b4032abb5a3d6516667c67034b7570af.mp3	Haleellaa Israa'el gamoo petrokeemikaalaa Iraan irratti raawwateen namoonni shan ajjeefamuu miidiyaan Iraan gabaase
8cf7bd0ce2376b68db28fd90078eebe0.mp3	Miidiyaaleen mootummaa Iraan haleellaa dhaabbata petrokeemikaalaa Iraan irratti raawwatameen namoonni shan du'uusaanii gabaasaa jiru.
b064ac8eefdd10fd39c988aea7288813.mp3	Tiraamp loltuu akka hin bobbaafne hime, Jappaan 'nu gargaaruu qabdi' jedhe
ecb7b9b69a3122c29af9275ae5fd8377.mp3	Kabbadaa Badhaasee abbaa isaa obbo Badhaasee Irkoo fi haadhaa isaa aaddee Maammee Tulluu irraa bara 1984 A.L.O tti caamsaa 23
2cdc85929db39fd1840f33fcc466234e.mp3	godina Shawaa Lixaa Aanaa Xuqur Incinnii Ganda Qonnaan bulaa Naannoo Jidduutti dhalate.
9043b878f2cdc5ab2aa44f6f8db6f5bb.mp3	Pireezidantiin US Donaald Tiraamp waraannisaanii haleellaa Iraan keessatti gageessuun karooraa olitti milkaahaa jiraachuu ibse.
bc9d0f4acc843965e577378774a8910c.mp3	Israa'el ammoo dhaabbatichi kan Iraan misaa'ela balistikii isheef "qaama murteessaa" ta'e kan ittiin omishtuudha jetteetti.

Description

Specifics

Considerations

Processes

Metadata