Ghomala-MultimodalDataset

Description

Ghomala-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of the ɣɔmáláʔ language (Ghɔmala'), as documented in the Bandjoun and Bamougoum villages of the West Region of Cameroon. ɣɔmáláʔ is a Grassfields Bantu language of the Mbam-Nkam branch of the Bantoid family. It is rarely represented in existing computational resources. The dataset was compiled in the context of doctoral research on the forms and functions of ritual language in the ɣɔmáláʔ-speaking community (2020-2021). The dataset comprises three closely aligned components: (i) a structured fieldwork datasheet containing 376 IPA-transcribed example sentences extracted from recorded ritual speech events, together with their word-for-word parsing, interlinear glosses and French translations; (ii) 369 high-quality audio recordings of these sentences, produced by a native speaker of ɣɔmáláʔ across four recording sessions; and (iii) per-session audio–sentence mapping files enabling precise alignment between the textual and acoustic data. The dataset additionally includes a bilingual parallel corpus (Ghomala–French) in TSV format, derived from the same source material. The ritual texts captured in this dataset originate from five distinct ceremonial contexts documented in the Bandjoun and Bamougoum speech communities: rites of intercession for healing, goat sacrifice rituals, dowry ceremonies, purification rites, and installation rites. This breadth of ritual registers makes the dataset particularly valuable for studying specialised and formulaic language use in a tonal Grassfields Bantu language. From a methodological perspective, the dataset bridges language documentation and language technology. The parallel availability of IPA-transcribed text in ɣɔmáláʔ and French, alongside aligned speech, makes it suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling and multimodal language learning tools. The structured datasheet, with its interlinear glosses and word-level parsing, additionally supports linguistic analysis, contrastive studies with other Grassfields Bantu varieties, and pedagogical uses in teacher training and language revitalisation contexts. The phonological inventory documented in this dataset — including a complex tonal system, ejective consonants, nasal vowels and vowel harmony — reflects the full structural richness of ɣɔmáláʔ, and contributes to a more inclusive and granular representation of African linguistic diversity in language technology resources.

Language

ɣɔmáláʔ (ghɔmala') is a Grassfields Bantu language of the Bamiléké-central group, belonging to the Mbam-Nkam branch of the Bantoid family. According to the Administrative Atlas of Cameroon (Breton and Bikia Fohtung 1991), it is spoken predominantly in the Mifi Division in the West Region of Cameroon, and is internally divided into four dialect sub-areas. This dataset primarily represents the ghɔmala-central sub-area, specifically the jo variety (Bandjoun).

Variants

According to Breton and Bikia Fohtung (1991), the ɣɔmáláʔ (ghɔmala') language comprises four dialectal clusters:

Northern Ghɔmala-: fʉ'sap (Bafoussam) dialect and laŋ (Baleng) dialect
Ngemba (ghɔmala-ouest): parlers mugum (Bamugum), meka (Bameka) and mɔnjɔ (Bamenju)
Central Ghɔmala: jo (Bandjoun) dialect, we (Bahuan) dialect, hɔm (Baham) dialect and yogam (Bayangam) dialect.
Southern Ghɔmala: tɛ' (Batiɛ) dialect, pa (Bapa) dialect and denkwop (Badenkop) dialect

The present dataset primarily represents the Central Ghɔmala sub-area. The dominant variety is the jo parler of Bandjoun, with additional material drawn from the we parler of Bahuan (Bamougoum). These two varieties constitute the speech communities investigated in the doctoral research from which this dataset originates.

Writing System

The writing system used for the transcription of ɣɔmáláʔ in this dataset is the International Phonetic Alphabet (IPA), as reflected in the sentence and sentence_parsed columns of the fieldwork datasheet.

1. Vowels

The vowel system attested in the dataset includes oral and nasal vowels with tonal marking:

Oral vowels: a, ā, á, â, ǎ, e, ə, ə̄, ɔ, ɔ̄, ɔ́, ɔ̂, ɔ̌, o, ō, ó, ô, ǒ, u, ū, ú, û, ǔ, i, ī, ɛ, ɛ́, ʉ, ʉ́, ʉ̌

Features: vowel harmony, nasal vowels, ejective variants on consonants preceding vowels

2. Consonants

The consonant inventory reflected in the dataset includes:

b, bv, d, dz, dʒ, f, g, gʉ, h, j, k, kh, l, m, mb, n, nd, ng, ŋ, p, pj, r, s, sh, t, tf, tʃ, ts, v, w, z, ʒ, ᴓ (voiced bilabial fricative), ʔ (glottal stop)

Prenasalised and palatalised consonants are attested throughout the corpus. The symbol ᴓ appears consistently across ritual utterances.

3. Tone system

The dataset shows lexical and grammatical contrastive tones marked directly on vowels:

High tone (H): á, ə́, ɔ́, ó, ú
Low tone (L): (unmarked in most forms)
Falling tone (HL): â, ə̂, ɔ̂, ô, û
Rising tone (LH): ǎ, ə̌, ɔ̌, ǒ, ǔ
Mid / level tone: ā, ə̄, ɔ̄, ō, ū

Source

The dataset was compiled from video and audio recordings of ritual speech events collected through fieldwork in the villages of Bandjoun and Bamougoum (West Region, Cameroon), as part of doctoral research on pragmalinguistic aspects of ritual communication in the ɣɔmáláʔ-speaking community (Université de Yaoundé I, 2025).

Domain

The dataset represents authentic ritual speech in ɣɔmáláʔ, covering five ceremonial registers: intercession rites for healing, goat sacrifice, dowry, purification, and chiefly installation. All utterances are drawn from naturally occurring ritual discourse rather than elicited speech.

Size

Total audio duration: 1,057 seconds (00:17:37), distributed across 369 MP3 audio clips in 4 recording sessions. The fieldwork datasheet contains 376 rows. Total uncompressed dataset size: approximately 25 MB.

Structure

The dataset comprises:

A fieldwork datasheet (Fieldwork-Dataset_Ghomala.tsv) with 376 rows and 7 columns;
369 MP3 audio clips read by a single native speaker of ɣɔmáláʔ (Bandjoun variety), with a total duration of 1,057 seconds (00:17:37), distributed across 4 recording sessions:
- Session 01: 91 clips (6m 34s)
- Session 02: 100 clips (5m 00s)
- Session 03: 94 clips (6m 29s)
- Session 04: 84 clips (4m 03s)
Four per-session sentence-to-audio mapping files (mapping.tsv), each with 4 columns;
A bilingual parallel corpus Ghomala–French (Corpus_Parallele_Ghomala-Francais.xlsx and .pdf) with 376 sentence pairs.

Description of columns (Fieldwork-Dataset_Ghomala.tsv)

#id: sequential identifier of the sentence entry
#language: ISO 639-3 language code (bbj = ɣɔmáláʔ / Ghomala)
#sentence: sentence in ɣɔmáláʔ, transcribed in IPA
#sentence_parsed: word-for-word parsing of the sentence, elements separated by |
#gloss: interlinear grammatical gloss, elements separated by |
#translation_fr: French translation of the sentence
#source: source document from which the sentence was extracted

Description of columns (mapping.tsv)

#audio_filename: filename of the audio clip
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker
#attempts: number of recording attempts before acceptance

Sample

audio file	sentence (ɣɔmáláʔ)
00ae644aaa7b62ca2ba8133a03b9ab0f.mp3	Gᾱ ā pə ́ bə̄ nwə̄ jə̄ŋ lə̄ pɔ́k á gᾱkə́lə́
02f38b6f19261c468c921fed8b146313.mp3	bə̄ á gᾱkə́lə́ nə̂ jə́ tsə́,
02f79816a6a683c5816fe6bb15e94aad.mp3	bə̄ pjə̄ ᴓ sɔ̄ʔ pôʔ tjə́ʔɔ̄ nə́ njāptə̄ já.
0409f43403a9b0ea13c0996fff4586c9.mp3	Pō ᴓ lúsí ntámdzə̄
068185ecdf253fd4fce5b133b23f9039.mp3	há gʉ́ʔ pə̄ nə́ ŋkwítə́ gūŋ mōnə̄ŋ āwɛ́ láʔ ɔ̄
0dd889caaac1cc8564992008c7b39900.mp3	nə̂ thə́ gūŋ pǒ Mɔ̂ʃjə̄ āwɛ́,
129ae02a0cd33a750dce2e6a1616a4a8.mp3	â tə́m ӡʉ́mӡʉ́m.
1889a56170f94b0b3fa22215382fe01a.mp3	tʃjə̄pɔ̄ támdzə̄.