Kekem-ALCAM-MultimodalDataset

Description

Kekem-ALCAM-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of the Kekem variety of Mbo (ISO 639-3: mbo), a Bantu language spoken in and around the Kekem subdivision of the Haut-Nkam Department of Cameroon's West Region. Mbo, and the Kekem dialect in particular, remains virtually absent from existing grammatical descriptions, computational resources, and lexicographical tools. The dataset comprises three closely aligned components: (i) a structured datasheet containing carefully selected example sentences and lexical entries reflecting attested usage in the Kekem variety of Mbo; (ii) high-quality audio recordings of these entries and sentences, produced by a native speaker across four recording sessions; and (iii) per-session audio–sentence mapping files enabling precise alignment between the textual and acoustic data. The dataset's primary added value lies in its explicit focus on the Kekem variety of Mbo, a language that, like many other Bantu languages of Cameroon situated at the crossroads of linguistic regions, remains essentially absent from reference grammars, dictionaries, educational materials, and language technology resources. Although Kekem is administratively located in the West Region, the indigenous Mbo people of this area do not identify as Bamiléké; rather, they share closer linguistic and historical ties with the Sawa (coastal) peoples. The Kekem dialect displays a range of phonological and morphosyntactic features characteristic of Mbo, including a complex system of vowel contrasts, tonal distinctions, prenasalised consonants and glottal closure markers, all of which are essential for understanding the language's structural specificity and are rarely documented in machine-readable form. In this sense, the dataset contributes to a more inclusive and granular representation of African linguistic diversity. From a methodological perspective, the dataset is designed to bridge the gap between language documentation and language technology. The parallel availability of text in Kekem (in IPA transcription) and in French, alongside aligned speech, makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling, and multimodal language learning tools. At the same time, the structured datasheet supports linguistic analysis, comparison with other Mbo dialects and related coastal Bantu languages, and pedagogical uses in teacher training and language revitalisation contexts. The dataset was collected through the Atlas Linguistique du Cameroun (ALCAM) questionnaire framework, designed to gather basic lexical and grammatical information about Cameroonian national languages. The audio recordings were produced at the École Normale Supérieure de Yaoundé (ENS-Yaoundé) in June 2026, in the framework of the Mozilla Data Collective project. More broadly, the Kekem-ALCAM-MultimodalDataset exemplifies an approach to African language resources that highlights fluidity, orality, phonological richness, and community-based linguistic practice.

Language

Mbo (ISO 639-3: mbo) is a Bantu language belonging to the Southern Bantoid branch of the Niger-Congo family. It is spoken primarily in the Moungo Division of the Littoral Region of Cameroon, with the Kekem variety (sometimes referred to as "Mbo: Kekem") spoken in and around the Kekem subdivision of the Haut-Nkam Department, West Region. Despite its sociolinguistic significance within Cameroon, Mbo and its varieties remain substantially underrepresented in language technology resources. Although the Kekem subdivision is administratively part of the West Region, the indigenous Mbo speakers of this area do not identify as Bamiléké; they share closer historical and linguistic ties with the Sawa (coastal) peoples of the Littoral Region.

Variants

As documented in the Administrative Atlas of Cameroon's Languages (Breton & Bikia Fohtung 1991) and in field-linguistic records, the Mbo language comprises a set of geographically distributed speech varieties, including: Bareko, Melong, Santchou, and Kekem (the variety represented in this dataset), as well as others documented in the Moungo area (Bonkeŋ, Central-Mbo, Ehɔw, Mba, Ehɔ Mbo, Alɛ mbuu, Bakem). The present dataset covers the Kekem variety exclusively; resources for other varieties are not included in this release.

Writing System

The writing system used for the transcription of Kekem in this dataset is the International Phonetic Alphabet (IPA), as reflected in lexical entries (Word) and sentence-level examples (LangEx) in the datasheet. The phonological inventory described below is derived directly from the attested forms in the LangEx and Word columns of the Kekem datasheet.

1. Vowels

The vowel system attested in the dataset includes both oral and central vowels:

Oral vowels: i, e, ɛ, a, ɔ, o, u, ə, ʉ, ɨ

Where:

ɛ (epsilon): open-mid front unrounded vowel
ɔ (open-o): open-mid back rounded vowel
ə (schwa): mid-central vowel
ʉ (barred u): high central rounded vowel
ɨ (barred i): high central unrounded vowel

Long vowels are represented by vowel doubling (e.g., əə, ɔɔ).

2. Consonants

The consonant inventory reflected in the dataset includes simple, prenasalised, affricate, and fricative consonants:

b, d, f, g, ɣ, h, k, kp, l, m, mb, mp, mv, n, nd, ng, nʃ, ŋ, p, r, s, sy, ʃ, t, v, w, y, z, ɲ

Special symbols:

ɣ: voiced velar fricative
ʃ: voiceless postalveolar fricative
ʼ (modifier letter apostrophe): glottal stop / glottal closure marker

3. Tone system

The datasheet shows lexical and grammatical contrastive tones, marked directly on vowels using IPA diacritics. The following tonal categories are attested in the LangEx and Word columns:

High tone (H): á, é, ɛ́, í, ó, ɔ́, ú, ə́, ʉ́
Mid tone (M): ā, ē, ɛ̄, ī, ō, ɔ̄, ū, ə̄, ʉ̄
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù, ə̀, ʉ̀
Falling contour tone (HL): â, ê, î, ô, ɔ̂, û, ə̂
Rising contour tone (LH): ǎ, ě, ǐ, ǒ, ɔ̌, ǔ, ə̌

Unmarked vowels represent tonally neutral or contextually determined syllables.

Source

The dataset was collected through a questionnaire designed to gather basic information about the Kekem lexicon and grammar. This was done as part of the Atlas Linguistique du Cameroun (ALCAM) project. Audio recordings were produced by a native speaker of Kekem.

Domain

The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information of the Kekem variety of Mbo.

Size

Total audio duration: 1,127 seconds (18m 47s), distributed across 277 MP3 audio clips in 4 recording sessions.

Structure

The dataset is organised into 4 recording sessions, each corresponding to a folder containing MP3 audio clips and a per-session sentence-to-audio mapping file:

Session kekem_alcam_dataset_01: 71 clips (04m 50s)
Session kekem_alcam_dataset_02: 49 clips (03m 58s)
Session kekem_alcam_dataset_03: 88 clips (05m 44s)
Session kekem_alcam_dataset_04: 69 clips (04m 13s)

Each session folder contains:

MP3 audio clips
One per-session sentence-to-audio mapping file (mapping.tsv)

Additionally, the root of the dataset includes: 3) A structured datasheet (Kekem-ALCAM-Datasheet.tsv) with lexical entries and example sentences in IPA.

Description of columns (Kekem-ALCAM-Datasheet.tsv)

#OrigID: original number of lexical entry on paper questionnaire
#EditID: modification of #OrigID
#FrenchRef: reference entry (originally provided in French)
#FrenchComm: original comments about reference entry (#FrenchRef)
#French: lexical entry in French (overlaps with #FrenchRef)
#Note: note of researcher on the lexical entry
#POS: part of speech
#Class: noun class (where applicable)
#Morf: morphological attribute (e.g. plural, singular)
#Var: dialect variant tag
#Word: lexical entry in Kekem (IPA)
#CrossRef: cross-referencing of lexical entry number
#FrenchEx: example sentence in French
#LangEx: example sentence in Kekem (IPA)
#LangExEdit: manual editing of #LangEx
#FrenchExEdit: edited French equivalent of #FrenchEx
#LangPars: word-for-word parsing in Kekem
#LangParsEdit: editing of #LangPars
#FrenchPars: French equivalent of #LangParsEdit
#FrenchParsEdit: editing of #FrenchPars

Description of columns (mapping.tsv)

#audio_filename: filename of the audio clip (MP3)
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker, transcribed in IPA
#attempts: number of recording attempts before acceptance

Sample

audio file	sentence (Kekem, IPA)
c692b09d295784f5c901ffbdf3461581.mp3	nʃwì ; yɔ́ ɣʉ mɛ̄nyǎ' nʃẅì
8204f91955cfe4475e22d59ca6eaa720.mp3	nʃwì
2ac1c1d9d0c0dbaa57fef45099699376.mp3	zə̄h ; Pʉ̄ ná' mē sʉ'ʉ nə̂h mɔ̄p
0a3dd8a87e7bddd8c65977f34c5ae96c.mp3	twī ; yɔ́ ɣʉ mā'à twī pí syà ntɔ̄k myà'
81c2b0dd7eaa7b99e6cb2f9566e59d7b.mp3	nzɔ̄ ; nzɔ́ lā:h
0a24ea2f8c0976cd380fadda11cbe405.mp3	nɛ̀p ; pə̌h mə̀ʒyɛ̀ nɛ̀p
d1dbe039ddc5250403b1979e2aa89f6b.mp3	fə̀fɛ̀ ; fə̀fɛ̀ mə̀ncə́'
df8812bcbefb8454ca11f1dac553d876.mp3	mə̀' ; mə̌' ní nìnyà
7dd833d14dc0cacb81554bdc1384b96c.mp3	ŋgə̀βə̀ ; à mwák èkə̀kɔ̌p yɛ̀n
506170f7cf7dce47f326d87b6acf178a.mp3	fù ; ǹ tə́m n yèé éyí mə̀wɔ̀m
f2292c726dbb5d2b9a11ebbc8ac02ebc.mp3	yèē tə́ mbɔ̀ɣɔ̀ sɔ̀ɔ̄
186bdee33fa1db639e48ef717d967001.mp3	à fœ̀p yǒmdyɛ̀:
0630fdbc53a291ac1476d5b8a68dc214.mp3	ɣū' wū' ; míī pə̌ŋkə́ bá ŋkù ə̀fə́nə̀
3e079618b8f6b5c71e0962ab98e51647.mp3	zɨ ; nə̀ yār mə̀
5fd1001736d31d9a1af30be715a9180e.mp3	sɨ̄ sɨ̄ ; dìɛ́ tɛ́