Sample-Ngomba-TTS-Dataset

Description

Sample-Ngomba-TTS-Dataset is a scripted speech dataset dedicated to the documentation and technological development of Ngomba (ISO 639-3: jgo), a Grassfields Bantu language spoken in the Bamboutos Division of the West Region of Cameroon. The dataset was compiled in the framework of the Mozilla Data Collective initiative (2026). The dataset comprises 1033 high-quality audio recordings of Ngomba sentences read by a native speaker across 11 recording sessions (predominantly MP3 format, with 2 recordings in WAV format in session 07), together with per-session sentence-to-audio mapping files enabling precise alignment between textual and acoustic data. Sentences were drawn from a scripted speech prompt list and read in a controlled environment. The transcription of all sentences follows the General Alphabet of Cameroon's Languages (AGLC; French acronym: Alphabet Général des Langues du Cameroun), the reference standard for Cameroonian national languages. The Ngomba orthography employed in this dataset is distinguished by an extended vowel inventory — including the open-mid front unrounded vowel ɛ, the open-mid back rounded vowel ɔ, the high central rounded vowel ʉ (barred u), and the vowel ʉ̈ (barred u with diaeresis), which functions as a distinct phonemic grapheme in Ngomba — as well as a series of labialized consonants written by appending ẅ (w with diaeresis) to the base consonant (e.g., gẅ, sẅ, cẅ, kẅ, tsẅ), a multi-register tone-marking system combining level (acute, grave) and contour (caron, circumflex) diacritics applied to vowels and syllabic nasals, and the Latin small letter saltillo (ꞌ, U+A78C) for glottal closure. The parallel availability of AGLC-transcribed text and aligned speech makes the dataset suitable for a wide range of applications, including text-to-speech (TTS) synthesis, automatic speech recognition (ASR), forced alignment, pronunciation modelling, and language learning tools. It also directly supports efforts to standardise and normalise the digital representation of Ngomba in language technology contexts.

Language

Ngomba (ISO 639-3: jgo) is a Grassfields Bantu language belonging to the Niger-Congo phylum, classified within the Mbam-Nkam branch. It is spoken in the Bamboutos Division of the West Region of Cameroon, where it belongs to the Central-Bamileke cluster alongside Mengaka, Ngombale, and Ngiemboon (Breton & Bikia Fohtung 1991). Despite its sociolinguistic significance within the Bamileke cultural area of Cameroon, Ngomba remains substantially underrepresented in language technology resources.

Variants

According to the Administrative Atlas of Cameroon's Languages (Breton & Bikia Fohtung 1991), Ngomba comprises the following dialects:

Bameso
Bamendjinda
Bamete

Writing System

The writing system used for the transcription of Ngomba in this dataset is the General Alphabet of Cameroon's Languages (AGLC). The AGLC provides a phonologically motivated orthographic standard for Cameroonian national languages and serves as the reference framework for Ngomba literacy materials.

1. Vowels

The vowel system attested in the dataset includes the following oral vowels:

a, e, i, o, u, ɛ, ɔ, ʉ, ʉ̈

Where:

ɛ (epsilon): open-mid front unrounded vowel
ɔ (open-o): open-mid back rounded vowel
ʉ (barred u): high central rounded vowel
ʉ̈ (barred u with diaeresis): a distinct Ngomba vowel grapheme, frequently appearing in the sequence ʉ̈ɔ and in standalone position (e.g., kʉ̈, cʉ̈, pʉ̈)

Long vowels are represented by vowel doubling (e.g., aa, ɛɛ, ɔɔ, uu, ʉʉ).

2. Consonants

The consonant inventory reflected in the dataset includes simple, digraph, and labialized consonants:

b, c, d, f, g, h, j, k, l, m, n, p, s, sh, t, v, w, y, z, ŋ

Labialized consonants are formed by appending ẅ (w with diaeresis, U+1E85) to the base consonant or cluster:

gẅ: labialized voiced velar stop
sẅ: labialized alveolar fricative
cẅ: labialized palatal affricate
kẅ: labialized velar stop
tsẅ: labialized alveolar affricate

Special symbols:

ŋ (eng, U+014B): velar nasal consonant
sh: voiceless postalveolar fricative
ẅ (w with diaeresis, U+1E85): labialization marker, appended to consonants
ꞌ (Latin small letter saltillo, U+A78C): glottal stop / glottal closure marker

3. Syllabic nasals

Ngomba attests syllabic nasal consonants that function as tone-bearing units. The following tone-marked syllabic nasals are represented in the dataset:

ḿ (m with acute): syllabic bilabial nasal, high tone
ń (n with acute): syllabic alveolar/palatal nasal onset, high tone
ǹ (n with grave): syllabic alveolar nasal, low tone
ŋ́ (eng with acute): velar nasal onset, high tone

4. Tone system

Ngomba is a tonal language with multiple contrastive pitch levels and contour tones. The dataset employs systematic tone marking on vowels and syllabic nasals in accordance with the AGLC convention. The following diacritics are attested in the dataset:

Level tones:

High tone (H): acute accent — á, é, í, ó, ú, ɛ́, ɔ́, ʉ́
Low tone (L): grave accent — à, è, ì, ò, ù, ɛ̀, ɔ̀, ʉ̀

Contour tones:

Falling tone (HL): circumflex — â, ê, î, ô, û, ɛ̂, ɔ̂, ʉ̂
Rising tone (LH): caron — ǎ, ě, ǐ, ǒ, ǔ, ɛ̌, ɔ̌, ʉ̌

Mid tone is generally left unmarked in the Ngomba AGLC orthography.

Source

The dataset was compiled from scripted speech prompt lists read by a native speaker. Sentences were selected to provide broad phonological coverage of Ngomba and were transcribed in accordance with the AGLC orthographic standard.

Domain

The dataset represents scripted speech in Ngomba, covering a broad range of everyday sentence types drawn from a general-purpose TTS/ASR prompt list. All utterances are scripted rather than spontaneous.

Size

Total audio duration: 3,978 seconds (01h 06m 18s), distributed across 1033 audio clips in 11 recording sessions.

Structure

The dataset is organised into 11 recording sessions:

Session tts_dataset_jgo_01: 100 clips (08m 04s)
Session tts_dataset_jgo_02: 100 clips (07m 18s)
Session tts_dataset_jgo_03: 100 clips (08m 13s)
Session tts_dataset_jgo_04: 100 clips (08m 01s)
Session tts_dataset_jgo_05: 99 clips (06m 19s)
Session tts_dataset_jgo_06: 100 clips (05m 54s)
Session tts_dataset_jgo_07: 100 clips (05m 39s)
Session tts_dataset_jgo_08: 100 clips (05m 35s)
Session tts_dataset_jgo_09: 100 clips (04m 44s)
Session tts_dataset_jgo_10: 100 clips (04m 34s)
Session tts_dataset_jgo_11: 34 clips (01m 51s)

Each session folder contains:

Audio clips (MP3 format; session 07 additionally contains 2 WAV files)
One per-session sentence-to-audio mapping file (mapping.tsv), with 4 columns

Description of columns (mapping.tsv)

#audio_filename: filename of the audio clip (MP3 or WAV)
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker, transcribed in AGLC orthography
#attempts: number of recording attempts before acceptance

Sample

audio file	sentence (Ngomba, AGLC)
8ee4d1075531663ae1a33aaf4d1024c0.mp3	Má ka ŋgɔ́ njʉ̈ɔ́ mɔ́ɔ mɛtap pɔ ku wɛ
b4281986fcc6921aa360078b8136a25a.mp3	Tǎa cẅímankɔ' gá ŋ́kap yaa nɛ́mɔ
e9b2e5ce93602762b6689cda32a85efe.mp3	Ŋ gẅɛɛ nɛgʉ tʉ́sɔn ŋkɔɔnɛ
a591642f8e41460b6cf4632b359d862f.mp3	Ɛ píkŋɛ nɛ́ jʉ̈ɔ́ pʉ̈ɔ pɛ́nɛ́túgɛ́
454833864f3c4e15de1709f1f8ff9254.mp3	Pɔ́p ká kwɛ́tyúu pɔ́ tɛ fú mbaꞌámbaꞌá