Ngemba-ALCAM-MultimodalDataset

Description

Ngemba_ALCAM-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of the Ngemba language, also referred to in the literature as Ghomala-Ouest (Breton and Bikia Fohtung 1991). Ngemba is a Grassfields Bantu language spoken in the West Region of Cameroon and is rarely represented in existing standard grammatical descriptions, computational resources or lexicographical tools. The dataset comprises three closely aligned components: (i) a structured datasheet containing carefully selected example sentences and lexical entries reflecting attested usage in Ngemba; (ii) high-quality audio recordings of these entries, produced by a native speaker; and (iii) an explicit audio–sentence mapping file enabling precise alignment between the textual and acoustic data. The dataset's primary added value lies in its explicit focus on Ngemba, a language that, like many other Grassfields Bantu languages, remains virtually absent from reference grammars, dictionaries, educational materials and language technology resources. The dataset captures a range of phonological and morphosyntactic features characteristic of Ngemba, including a complex system of vowel harmony, nasal vowels, ejective consonants and lexical tone, all of which are essential for understanding the language's structural specificity and are rarely documented in machine-readable form. In this sense, the dataset contributes to a more inclusive and granular representation of African linguistic diversity. From a methodological perspective, the dataset is designed to bridge the gap between language documentation and language technology. The parallel availability of text in Ngemba and in French, alongside aligned speech, makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling and multimodal language learning tools. At the same time, the structured datasheet supports linguistic analysis, contrastive studies with other Grassfields Bantu varieties and pedagogical uses in teacher training and language revitalisation contexts. More broadly, the Ngemba_ALCAM-MultimodalDataset exemplifies an approach to African language resources that highlights fluidity, orality, phonological richness and community-based linguistic practice.

Language

Ngemba, also designated Ghomala-Ouest in the Administrative Atlas of Cameroon (Breton and Bikia Fohtung 1991), is a Grassfields Bantu language belonging to the wider Mbam-Nkam branch of the Bantoid family. It is spoken primarily in the West Region of Cameroon. According to Breton and Bikia Fohtung (1991), Ngemba comprises three main speech varieties: Mugum (Bamugum), Meka (Bameka), and Monjo' (Bamenju). The language is rarely represented in standard grammatical descriptions or computational resources.

Variants

As documented in the Administrative Atlas of Cameroon (Breton and Bikia Fohtung 1991), the Ngemba language encompasses three identified speech varieties: Mugum (spoken in Bamugum), Meka (spoken in Bameka), and Monjo' (spoken in Bamenju). These varieties display phonological features — including nasal vowels, ejective consonants and a complex tonal system — that are characteristic of the Grassfields Bantu area. At the time of publication of this dataset, a full systematic comparative description of variation across these three varieties is not yet available.

Writing System

The writing system used for the transcription of Ngemba in this dataset is the International Phonetic Alphabet (IPA), as reflected in lexical entries (Word) and sentence-level examples (LangEx) in the datasheet. The phonological inventory described below is derived directly from the attested forms in the LangEx and Word columns of the datasheet.

1. Vowels

The vowel system attested in the dataset includes both oral and nasal vowels:

Oral vowels: i, e, ɛ, a, ɔ, o, u, ə, ɨ, ʉ

Nasal vowels: ĩ, ẽ, ã, õ, ũ, ə̃

These vowels occur with and without tone marking in lexical items and running text (e.g. èntšù 'mouth', lezɨk 'eye', mèntšù 'mouths', lepʉ 'feather').

2. Consonants

The consonant inventory reflected in the dataset includes the following simple, prenasalized, affricate and ejective consonants:

b, d, dz, f, g, h, k, kw, l, m, mb, mv, n, nd, ng, ŋ, ŋk, p, r, s, t, th, ts, tš, v, w, y, z, ɲ, ʔ

These consonants appear consistently across noun stems, verbal forms, derivational patterns and noun-class alternations (e.g. èntšù 'mouth', lezɨk 'eye', ntsɔʔ 'ear', lesõ 'tooth', athu 'person').

3. Tone system

The datasheet shows lexical and grammatical contrastive tones, marked directly on vowels. The following tonal categories are attested in the LangEx and Word columns:

High tone (H): á, é, ɛ́, í, ó, ɔ́, ú, ə́, ʉ́
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù, ə̀, ʉ̀
Falling contour tone (HL): â, ê, î, ô, ɔ̂, û, ə̂, ẽ̂
Rising contour tone (LH): ǎ, ě, ǐ, ǒ, ɔ̌, ǔ, ə̌
Mid tone (M): ā, ē, ō, ɔ̄

Unmarked vowels represent tonally neutral or contextually determined syllables. Nasal vowels also carry tone distinctions (e.g. ẽ̀, õ̀, ũ).

Source

The dataset was collected through a questionnaire designed to gather basic information about the Ngemba lexicon and grammar. This was done as part of the Atlas Linguistique du Cameroun (ALCAM) project.

Domain

The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.

Size

Total size is approximately 9.9 MB (uncompressed). Total audio duration: 795.7 seconds (00:13:15).

Structure

The dataset comprises: 1) a datasheet (ALCAM_dataset_Ngemba.tsv) with 374 lines and 20 columns; 2) 257 voice clips (253 MP3 + 4 WAV) read by a single native speaker, with a total duration of 795.7 seconds (00:13:15); 3) a sentence-to-audio mapping file (mapping.tsv) with 257 lines and 4 columns.

Description of columns

#OrigID: original number of lexical entry on paper questionnaire
#EditID: modification of #OrigID
#FrenchRef: reference entry (originally provided in French)
#FrenchComm: original comments about reference entry (#FrenchRef)
#French: lexical entry in French (overlaps with #FrenchRef)
#Note: note of researcher on the lexical entry
#POS: part of speech
#Class: noun class (where applicable)
#Morf: morphological attribute (e.g. plural, singular)
#Var: (na)
#Word: lexical entry in Ngemba
#CrossRef: cross-referencing of lexical entry number
#FrenchEx: example sentence in French
#LangEx: example sentence in Ngemba
#LangExEdit: manual editing of #LangEx
#FrenchExEdit: edited French equivalent of #FrenchEx
#LangPars: word-for-word parsing in Ngemba
#LangParsEdit: editing of #LangPars
#FrenchPars: French equivalent of #LangParsEdit
#FrenchParsEdit: editing of #FrenchPars

Sample

audio files	words & sentences
0547a6c9c597d0345d6dae1ceed17bb8.mp3	èntšù; è ntšù mèndzʉ́ wɔ́ kɔ̀gí
c16be249d460cdfadd47ca50345ea0c4.mp3	mèntšù
7c6e1c4716a0c159ffde9bd21afba0f5.mp3	lezɨk; pó sì ntsì nĩ́ nìkə̀ mɔ̄p
05076b6f5f19231fa68e4a4a7ccf3703.mp3	ntsɔʔ; ǹtsɔ̀ʔ zì síŋ
064fabc648bf6cf6a93e69d65be5703c.mp3	lesõ; mèsõ̀ mí té põ̀
073ff73ce0412fcc9159c3503c453ce9.mp3	athu; á sí ndúwə̀ lwè zī
0888a5efb48f682de65b8a7da7795035.mp3	ale; á sí ŋkótɛ́ lêsò tsī
09775f1a529930b7779ca454dfec867d.mp3	letũ; á sí ŋkótɛ́ ndzìm lètṹŋ zì
09a5bd998693b803177aa33c56562fec.mp3	ntõ; ẽ̀ ntõ̀ yí fã̀ã̀
0b328f2ec0a3b24faa11d6d580ffd76b.mp3	lepʉ; mèndzʉ̀ wɔ̀ɔ́ sí sɔ̀gɔ̀ mbē