License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmq83jqac0141mk07gpu4vit2
Task: NLP
Release Date: 6/10/2026
Format: TSV, MP3
Size: 18.67 MB
Share
Ghomala-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of the ɣɔmáláʔ language (Ghɔmala'), as documented in the Bandjoun and Bamougoum villages of the West Region of Cameroon. ɣɔmáláʔ is a Grassfields Bantu language of the Mbam-Nkam branch of the Bantoid family. It is rarely represented in existing computational resources. The dataset was compiled in the context of doctoral research on the forms and functions of ritual language in the ɣɔmáláʔ-speaking community (2025). The dataset comprises three closely aligned components: (i) a structured fieldwork datasheet containing 376 IPA-transcribed example sentences extracted from recorded ritual speech events, together with their word-for-word parsing, interlinear glosses and French translations; (ii) 369 high-quality audio recordings of these sentences, produced by a native speaker of ɣɔmáláʔ across four recording sessions; and (iii) per-session audio–sentence mapping files enabling precise alignment between the textual and acoustic data. The dataset additionally includes a bilingual parallel corpus (Ghomala–French) in TSV format, derived from the same source material. The ritual texts captured in this dataset originate from five distinct ceremonial contexts documented in the Bandjoun and Bamougoum speech communities: rites of intercession for healing, goat sacrifice rituals, dowry ceremonies, purification rites, and installation rites. This breadth of ritual registers makes the dataset particularly valuable for studying specialised and formulaic language use in a tonal Grassfields Bantu language. From a methodological perspective, the dataset bridges language documentation and language technology. The parallel availability of IPA-transcribed text in ɣɔmáláʔ and French, alongside aligned speech, makes it suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling and multimodal language learning tools. The structured datasheet, with its interlinear glosses and word-level parsing, additionally supports linguistic analysis, contrastive studies with other Grassfields Bantu varieties, and pedagogical uses in teacher training and language revitalisation contexts. The phonological inventory documented in this dataset — including a complex tonal system, ejective consonants, nasal vowels and vowel harmony — reflects the full structural richness of ɣɔmáláʔ, and contributes to a more inclusive and granular representation of African linguistic diversity in language technology resources.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - That you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of the speaker in the dataset; attempting to clone the voice or train models that imitate the speaker in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment enables the evaluation of speech recognition models for ɣɔmáláʔ. Sentences are transcribed in IPA. There is currently no standardised orthography widely adopted for ɣɔmáláʔ; the General Alphabet of Cameroon's Languages (GACEL) provides a reference framework. - Text-to-speech (TTS): The dataset contains clean sentence–audio pairs and can be used to evaluate speech synthesis models. The use of IPA transcription should be taken into account when designing TTS experiments. - Speech–text alignment / forced alignment benchmarking: Fine-grained audio–text pairing provides ground truth for evaluating phoneme- or word-level aligners adapted to tonal and phonologically complex African languages. (b) Translation and multilingual tasks: - Machine translation (ɣɔmáláʔ ↔ French): The sentence-level alignment between ɣɔmáláʔ and French (376 pairs in the parallel corpus) makes this dataset suitable for evaluating translation models. - Speech translation (speech-to-text) (c) Linguistic and lexicographic tasks: - Pragmatic and discourse analysis: The ritual speech domain offers rich material for the study of formulaic language, speech acts, politeness strategies and discourse organisation in a Grassfields Bantu language. - Morphological analysis / glossed corpus studies: Interlinear glosses support computational morphology, grammar induction and ILT-based (Interlinear Language Text) modelling, particularly for Grassfields Bantu languages. - Language documentation: The dataset contributes to the digital documentation of ɣɔmáláʔ ritual speech, a domain that remains almost entirely absent from existing computational and reference resources.
ɣɔmáláʔ (ghɔmala') is a Grassfields Bantu language of the Bamiléké-central group, belonging to the Mbam-Nkam branch of the Bantoid family. According to the Administrative Atlas of Cameroon (Breton and Bikia Fohtung 1991), it is spoken predominantly in the Mifi Division in the West Region of Cameroon, and is internally divided into four dialect sub-areas. This dataset primarily represents the ghɔmala-central sub-area, specifically the jo variety (Bandjoun).
According to Breton and Bikia Fohtung (1991), the ɣɔmáláʔ (ghɔmala') language comprises four dialectal clusters:
Northern Ghɔmala-: fʉ'sap (Bafoussam) dialect and laŋ (Baleng) dialect
Ngemba (ghɔmala-ouest): parlers mugum (Bamugum), meka (Bameka) and mɔnjɔ (Bamenju)
Central Ghɔmala: jo (Bandjoun) dialect, we (Bahuan) dialect, hɔm (Baham) dialect and yogam (Bayangam) dialect.
Southern Ghɔmala: tɛ' (Batiɛ) dialect, pa (Bapa) dialect and denkwop (Badenkop) dialect
The present dataset primarily represents the Central Ghɔmala sub-area. The dominant variety is the jo parler of Bandjoun, with additional material drawn from the we parler of Bahuan (Bamougoum). These two varieties constitute the speech communities investigated in the doctoral research from which this dataset originates.
The writing system used for the transcription of ɣɔmáláʔ in this dataset is the International Phonetic Alphabet (IPA), as reflected in the sentence and sentence_parsed columns of the fieldwork datasheet.
The vowel system attested in the dataset includes oral and nasal vowels with tonal marking:
Oral vowels: a, ā, á, â, ǎ, e, ə, ə̄, ɔ, ɔ̄, ɔ́, ɔ̂, ɔ̌, o, ō, ó, ô, ǒ, u, ū, ú, û, ǔ, i, ī, ɛ, ɛ́, ʉ, ʉ́, ʉ̌
Features: vowel harmony, nasal vowels, ejective variants on consonants preceding vowels
The consonant inventory reflected in the dataset includes:
b, bv, d, dz, dʒ, f, g, gʉ, h, j, k, kh, l, m, mb, n, nd, ng, ŋ, p, pj, r, s, sh, t, tf, tʃ, ts, v, w, z, ʒ, ᴓ (voiced bilabial fricative), ʔ (glottal stop)
Prenasalised and palatalised consonants are attested throughout the corpus. The symbol ᴓ appears consistently across ritual utterances.
The dataset shows lexical and grammatical contrastive tones marked directly on vowels:
High tone (H): á, ə́, ɔ́, ó, ú
Low tone (L): (unmarked in most forms)
Falling tone (HL): â, ə̂, ɔ̂, ô, û
Rising tone (LH): ǎ, ə̌, ɔ̌, ǒ, ǔ
Mid / level tone: ā, ə̄, ɔ̄, ō, ū
The dataset was compiled from video and audio recordings of ritual speech events collected through fieldwork in the villages of Bandjoun and Bamougoum (West Region, Cameroon), as part of doctoral research on pragmalinguistic aspects of ritual communication in the ɣɔmáláʔ-speaking community (Université de Yaoundé I, 2025).
The dataset represents authentic ritual speech in ɣɔmáláʔ, covering five ceremonial registers: intercession rites for healing, goat sacrifice, dowry, purification, and chiefly installation. All utterances are drawn from naturally occurring ritual discourse rather than elicited speech.
Total audio duration: 1,057 seconds (00:17:37), distributed across 369 MP3 audio clips in 4 recording sessions. The fieldwork datasheet contains 376 rows. Total uncompressed dataset size: approximately 25 MB.
The dataset comprises:
A fieldwork datasheet (Fieldwork-Dataset_Ghomala.tsv) with 376 rows and 7 columns;
369 MP3 audio clips read by a single native speaker of ɣɔmáláʔ (Bandjoun variety), with a total duration of 1,057 seconds (00:17:37), distributed across 4 recording sessions:
Session 01: 91 clips (6m 34s)
Session 02: 100 clips (5m 00s)
Session 03: 94 clips (6m 29s)
Session 04: 84 clips (4m 03s)
Four per-session sentence-to-audio mapping files (mapping.tsv), each with 4 columns;
A bilingual parallel corpus Ghomala–French (Corpus_Parallele_Ghomala-Francais.xlsx and .pdf) with 376 sentence pairs.
#id: sequential identifier of the sentence entry
#language: ISO 639-3 language code (bbj = ɣɔmáláʔ / Ghomala)
#sentence: sentence in ɣɔmáláʔ, transcribed in IPA
#sentence_parsed: word-for-word parsing of the sentence, elements separated by |
#gloss: interlinear grammatical gloss, elements separated by |
#translation_fr: French translation of the sentence
#source: source document from which the sentence was extracted
#audio_filename: filename of the audio clip
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker
#attempts: number of recording attempts before acceptance
| audio file | sentence (ɣɔmáláʔ) |
|---|---|
| 00ae644aaa7b62ca2ba8133a03b9ab0f.mp3 | Gᾱ ā pə ́ bə̄ nwə̄ jə̄ŋ lə̄ pɔ́k á gᾱkə́lə́ |
| 02f38b6f19261c468c921fed8b146313.mp3 | bə̄ á gᾱkə́lə́ nə̂ jə́ tsə́, |
| 02f79816a6a683c5816fe6bb15e94aad.mp3 | bə̄ pjə̄ ᴓ sɔ̄ʔ pôʔ tjə́ʔɔ̄ nə́ njāptə̄ já. |
| 0409f43403a9b0ea13c0996fff4586c9.mp3 | Pō ᴓ lúsí ntámdzə̄ |
| 068185ecdf253fd4fce5b133b23f9039.mp3 | há gʉ́ʔ pə̄ nə́ ŋkwítə́ gūŋ mōnə̄ŋ āwɛ́ láʔ ɔ̄ |
| 0dd889caaac1cc8564992008c7b39900.mp3 | nə̂ thə́ gūŋ pǒ Mɔ̂ʃjə̄ āwɛ́, |
| 129ae02a0cd33a750dce2e6a1616a4a8.mp3 | â tə́m ӡʉ́mӡʉ́m. |
| 1889a56170f94b0b3fa22215382fe01a.mp3 | tʃjə̄pɔ̄ támdzə̄. |