Sample Ghomala-TTS-Dataset

Description

Sample-Ghomala-TTS-Dataset is a scripted speech dataset dedicated to the documentation and technological development of Ghomala (ISO 639-3: bbj), a Grassfields Bantu language spoken in the West Region of Cameroon. The dataset was compiled in the framework of the Mozilla Data Collective initiative (2026). The dataset comprises 997 high-quality audio recordings of Ghomala sentences read by a native speaker across 10 recording sessions (MP3 format), together with per-session sentence-to-audio mapping files enabling precise alignment between textual and acoustic data. Sentences were drawn from a scripted speech prompt list and read in a controlled environment. The transcription of all sentences follows the General Alphabet of Cameroon's Languages (AGLC; French acronym: Alphabet Général des Langues du Cameroun), the reference standard for Cameroonian national languages. The Ghomala orthography employed in this dataset is distinguished by an extended vowel inventory — including the schwa ə (mid central unrounded vowel), the open-mid front unrounded vowel ɛ, the open-mid back rounded vowel ɔ, and the high central rounded vowel ʉ (barred u), as well as the digraph aə functioning as a distinct complex vowel grapheme in Ghomala roots and affixes — alongside a rich consonant inventory comprising labio-velar stops (kp), labiodental obstruents (pf, bv), a velar fricative digraph (gh), postalveolar digraphs (sh, zh), affricates (c, ts, dz), labialized consonants (gw, gwy, cw), palatalised consonants (cy, shyə), and an extensive series of prenasalised and nasal-onset clusters (m+C and n+C, yielding forms such as mh, mn, mk, mgh, mf, mc, ms, mt, mzh, mj, mw, nt, nk, nw, nj, ny), as well as a multi-register tone-marking system combining level (acute, grave) and contour (caron, circumflex) diacritics applied to all nine vowels and to syllabic nasals, and the apostrophe (', U+0027) for marking glottal closure. The parallel availability of AGLC-transcribed text and aligned speech makes the dataset suitable for a wide range of applications, including text-to-speech (TTS) synthesis, automatic speech recognition (ASR), forced alignment, pronunciation modelling, and language learning tools. It also directly supports efforts to standardise and normalise the digital representation of Ghomala in language technology contexts.

audio file	sentence (Ghomala, AGLC)
f8e8e61ab4a258ade7423cecf7070d71.mp3	Wɔ́kpə wɛ́ gɔ ghɔ́ lá'
bc738a28dd4e04f3197fa3f52974a5a3.mp3	Pfaə̌ byâtà cʉ́m bɛ́
0a6b821936241f07e629230ecf7a5503.mp3	zhi'tə guŋ á Jo
48f9f388280da7e4d27bef3894aa850f.mp3	Gɛ̀là'tə̀ ŋwak səkú
aa4e0a61749423f02ac7260f5442edfa.mp3	Á wə cə́ŋ ghə́ ghɔm bǐ lɔ́yà.
453ccff43fc44163c177d6536ac00028.mp3	Zhʉ́zhʉ̂m bə́ kə́ ?
f773c536b8c53caffab688e8860c8ea1.mp3	Mghɛ̌və́ wə́ shimnyə pǒm səku mfʉ̌ puá bɔ'ɔ nə́ kam fa'
ca3876772fc304b6662768d2dfc9afc7.mp3	Pə́ ǒ gɔ kwipnyə yəŋ ma
2282e5caff7d5de5f882a5edc98abfde.mp3	Tə́ da'gaə́ é nɔ̂k é bə́ tə́ ghəm, bə́ á gɔ cʉ' pə́ ywə dyɛ'

audio file

sentence (Ghomala, AGLC)

f8e8e61ab4a258ade7423cecf7070d71.mp3

Wɔ́kpə wɛ́ gɔ ghɔ́ lá'

bc738a28dd4e04f3197fa3f52974a5a3.mp3

Pfaə̌ byâtà cʉ́m bɛ́

0a6b821936241f07e629230ecf7a5503.mp3

zhi'tə guŋ á Jo

48f9f388280da7e4d27bef3894aa850f.mp3

Gɛ̀là'tə̀ ŋwak səkú

aa4e0a61749423f02ac7260f5442edfa.mp3

Á wə cə́ŋ ghə́ ghɔm bǐ lɔ́yà.

453ccff43fc44163c177d6536ac00028.mp3

Zhʉ́zhʉ̂m bə́ kə́ ?

f773c536b8c53caffab688e8860c8ea1.mp3

Mghɛ̌və́ wə́ shimnyə pǒm səku mfʉ̌ puá bɔ'ɔ nə́ kam fa'

ca3876772fc304b6662768d2dfc9afc7.mp3

Pə́ ǒ gɔ kwipnyə yəŋ ma

2282e5caff7d5de5f882a5edc98abfde.mp3

Tə́ da'gaə́ é nɔ̂k é bə́ tə́ ghəm, bə́ á gɔ cʉ' pə́ ywə dyɛ'

Description

Specifics

Considerations

Processes

Metadata

Language

Variants

Writing System

1. Vowels

2. Consonants

3. Syllabic nasals

4. Tone system

Source

Domain

Size

Structure

Description of columns (mapping.tsv)

Sample