License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmpmpdwwh0217nu07zomtwz7o
Task: NLP
Release Date: 5/26/2026
Format: MP3, TSV
Size: 8.44 MB
Share
Ngemba_ALCAM-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of the Ngemba language, also referred to in the literature as Ghomala-Ouest (Breton and Bikia Fohtung 1991). Ngemba is a Grassfields Bantu language spoken in the West Region of Cameroon and is rarely represented in existing standard grammatical descriptions, computational resources or lexicographical tools. The dataset comprises three closely aligned components: (i) a structured datasheet containing carefully selected example sentences and lexical entries reflecting attested usage in Ngemba; (ii) high-quality audio recordings of these entries, produced by a native speaker; and (iii) an explicit audio–sentence mapping file enabling precise alignment between the textual and acoustic data. The dataset's primary added value lies in its explicit focus on Ngemba, a language that, like many other Grassfields Bantu languages, remains virtually absent from reference grammars, dictionaries, educational materials and language technology resources. The dataset captures a range of phonological and morphosyntactic features characteristic of Ngemba, including a complex system of vowel harmony, nasal vowels, ejective consonants and lexical tone, all of which are essential for understanding the language's structural specificity and are rarely documented in machine-readable form. In this sense, the dataset contributes to a more inclusive and granular representation of African linguistic diversity. From a methodological perspective, the dataset is designed to bridge the gap between language documentation and language technology. The parallel availability of text in Ngemba and in French, alongside aligned speech, makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling and multimodal language learning tools. At the same time, the structured datasheet supports linguistic analysis, contrastive studies with other Grassfields Bantu varieties and pedagogical uses in teacher training and language revitalisation contexts. More broadly, the Ngemba_ALCAM-MultimodalDataset exemplifies an approach to African language resources that highlights fluidity, orality, phonological richness and community-based linguistic practice.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - that you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of the speaker in the dataset; attempt to clone the voice or train models that imitate the speaker in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment allows the evaluation of speech recognition models for Ngemba. It should be noted that the sentences are transcribed using the IPA alphabet. There is currently no standardised orthography widely adopted for Ngemba; the General Alphabet of Cameroon's Languages (GACEL) provides a reference framework but has not been systematically applied to this language. - Text-to-speech (TTS): As the dataset contains clean sentence–audio pairs, it can also be used to evaluate speech synthesis or text-to-speech models. The use of IPA transcription rather than a conventional orthography should be taken into account when designing TTS experiments. - Speech–text alignment/forced alignment benchmarking: Fine-grained audio–text pairing provides ground truth for evaluating phoneme- or word-level aligners adapted to tonal and phonologically complex African languages. (b) Translation and multilingual tasks: - Machine translation (Ngemba ↔ French): The sentence-level alignment between Ngemba and French makes it a parallel corpus for evaluating translation models, with the caveat that the phonetic orthographic standard differs from any conventional writing system. - Speech translation (speech-to-text) (c) Linguistic and lexicographic tasks: - Morphological analysis/glossed corpus studies: The morpheme-level glosses are valuable for computational morphology, interlinear text modelling (ILTs) and grammar induction tasks, particularly for Grassfields Bantu languages. - Lexicon and part-of-speech tagging: Useful for building linguistic resources such as dictionaries, morphological analysers or POS taggers for Ngemba and related Grassfields Bantu languages. The dataset covers 374 lexical entries spanning nouns (218), verbs (76), adjectives (28), numerals (13), pronouns (10), prepositions (7) and adverbs (5).
Ngemba, also designated Ghomala-Ouest in the Administrative Atlas of Cameroon (Breton and Bikia Fohtung 1991), is a Grassfields Bantu language belonging to the wider Mbam-Nkam branch of the Bantoid family. It is spoken primarily in the West Region of Cameroon. According to Breton and Bikia Fohtung (1991), Ngemba comprises three main speech varieties: Mugum (Bamugum), Meka (Bameka), and Monjo' (Bamenju). The language is rarely represented in standard grammatical descriptions or computational resources.
As documented in the Administrative Atlas of Cameroon (Breton and Bikia Fohtung 1991), the Ngemba language encompasses three identified speech varieties: Mugum (spoken in Bamugum), Meka (spoken in Bameka), and Monjo' (spoken in Bamenju). These varieties display phonological features — including nasal vowels, ejective consonants and a complex tonal system — that are characteristic of the Grassfields Bantu area. At the time of publication of this dataset, a full systematic comparative description of variation across these three varieties is not yet available.
The writing system used for the transcription of Ngemba in this dataset is the International Phonetic Alphabet (IPA), as reflected in lexical entries (Word) and sentence-level examples (LangEx) in the datasheet. The phonological inventory described below is derived directly from the attested forms in the LangEx and Word columns of the datasheet.
The vowel system attested in the dataset includes both oral and nasal vowels:
Oral vowels: i, e, ɛ, a, ɔ, o, u, ə, ɨ, ʉ
Nasal vowels: ĩ, ẽ, ã, õ, ũ, ə̃
These vowels occur with and without tone marking in lexical items and running text (e.g. èntšù 'mouth', lezɨk 'eye', mèntšù 'mouths', lepʉ 'feather').
The consonant inventory reflected in the dataset includes the following simple, prenasalized, affricate and ejective consonants:
b, d, dz, f, g, h, k, kw, l, m, mb, mv, n, nd, ng, ŋ, ŋk, p, r, s, t, th, ts, tš, v, w, y, z, ɲ, ʔ
These consonants appear consistently across noun stems, verbal forms, derivational patterns and noun-class alternations (e.g. èntšù 'mouth', lezɨk 'eye', ntsɔʔ 'ear', lesõ 'tooth', athu 'person').
The datasheet shows lexical and grammatical contrastive tones, marked directly on vowels. The following tonal categories are attested in the LangEx and Word columns:
High tone (H): á, é, ɛ́, í, ó, ɔ́, ú, ə́, ʉ́
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù, ə̀, ʉ̀
Falling contour tone (HL): â, ê, î, ô, ɔ̂, û, ə̂, ẽ̂
Rising contour tone (LH): ǎ, ě, ǐ, ǒ, ɔ̌, ǔ, ə̌
Mid tone (M): ā, ē, ō, ɔ̄
Unmarked vowels represent tonally neutral or contextually determined syllables. Nasal vowels also carry tone distinctions (e.g. ẽ̀, õ̀, ũ).
The dataset was collected through a questionnaire designed to gather basic information about the Ngemba lexicon and grammar. This was done as part of the Atlas Linguistique du Cameroun (ALCAM) project.
The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.
Total size is approximately 9.9 MB (uncompressed). Total audio duration: 795.7 seconds (00:13:15).
The dataset comprises: 1) a datasheet (ALCAM_dataset_Ngemba.tsv) with 374 lines and 20 columns; 2) 257 voice clips (253 MP3 + 4 WAV) read by a single native speaker, with a total duration of 795.7 seconds (00:13:15); 3) a sentence-to-audio mapping file (mapping.tsv) with 257 lines and 4 columns.
#OrigID: original number of lexical entry on paper questionnaire
#EditID: modification of #OrigID
#FrenchRef: reference entry (originally provided in French)
#FrenchComm: original comments about reference entry (#FrenchRef)
#French: lexical entry in French (overlaps with #FrenchRef)
#Note: note of researcher on the lexical entry
#POS: part of speech
#Class: noun class (where applicable)
#Morf: morphological attribute (e.g. plural, singular)
#Var: (na)
#Word: lexical entry in Ngemba
#CrossRef: cross-referencing of lexical entry number
#FrenchEx: example sentence in French
#LangEx: example sentence in Ngemba
#LangExEdit: manual editing of #LangEx
#FrenchExEdit: edited French equivalent of #FrenchEx
#LangPars: word-for-word parsing in Ngemba
#LangParsEdit: editing of #LangPars
#FrenchPars: French equivalent of #LangParsEdit
#FrenchParsEdit: editing of #FrenchPars
| audio files | words & sentences |
|---|---|
| 0547a6c9c597d0345d6dae1ceed17bb8.mp3 | èntšù; è ntšù mèndzʉ́ wɔ́ kɔ̀gí |
| c16be249d460cdfadd47ca50345ea0c4.mp3 | mèntšù |
| 7c6e1c4716a0c159ffde9bd21afba0f5.mp3 | lezɨk; pó sì ntsì nĩ́ nìkə̀ mɔ̄p |
| 05076b6f5f19231fa68e4a4a7ccf3703.mp3 | ntsɔʔ; ǹtsɔ̀ʔ zì síŋ |
| 064fabc648bf6cf6a93e69d65be5703c.mp3 | lesõ; mèsõ̀ mí té põ̀ |
| 073ff73ce0412fcc9159c3503c453ce9.mp3 | athu; á sí ndúwə̀ lwè zī |
| 0888a5efb48f682de65b8a7da7795035.mp3 | ale; á sí ŋkótɛ́ lêsò tsī |
| 09775f1a529930b7779ca454dfec867d.mp3 | letũ; á sí ŋkótɛ́ ndzìm lètṹŋ zì |
| 09a5bd998693b803177aa33c56562fec.mp3 | ntõ; ẽ̀ ntõ̀ yí fã̀ã̀ |
| 0b328f2ec0a3b24faa11d6d580ffd76b.mp3 | lepʉ; mèndzʉ̀ wɔ̀ɔ́ sí sɔ̀gɔ̀ mbē |