License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmqf69mlt066omk07awntfrdn
Task: NLP
Release Date: 6/15/2026
Format: MP3, TSV
Size: 12.88 MB
Share
Kekem-ALCAM-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of the Kekem variety of Mbo (ISO 639-3: mbo), a Bantu language spoken in and around the Kekem subdivision of the Haut-Nkam Department of Cameroon's West Region. Mbo, and the Kekem dialect in particular, remains virtually absent from existing grammatical descriptions, computational resources, and lexicographical tools. The dataset comprises three closely aligned components: (i) a structured datasheet containing carefully selected example sentences and lexical entries reflecting attested usage in the Kekem variety of Mbo; (ii) high-quality audio recordings of these entries and sentences, produced by a native speaker across four recording sessions; and (iii) per-session audio–sentence mapping files enabling precise alignment between the textual and acoustic data. The dataset's primary added value lies in its explicit focus on the Kekem variety of Mbo, a language that, like many other Bantu languages of Cameroon situated at the crossroads of linguistic regions, remains essentially absent from reference grammars, dictionaries, educational materials, and language technology resources. Although Kekem is administratively located in the West Region, the indigenous Mbo people of this area do not identify as Bamiléké; rather, they share closer linguistic and historical ties with the Sawa (coastal) peoples. The Kekem dialect displays a range of phonological and morphosyntactic features characteristic of Mbo, including a complex system of vowel contrasts, tonal distinctions, prenasalised consonants and glottal closure markers, all of which are essential for understanding the language's structural specificity and are rarely documented in machine-readable form. In this sense, the dataset contributes to a more inclusive and granular representation of African linguistic diversity. From a methodological perspective, the dataset is designed to bridge the gap between language documentation and language technology. The parallel availability of text in Kekem (in IPA transcription) and in French, alongside aligned speech, makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling, and multimodal language learning tools. At the same time, the structured datasheet supports linguistic analysis, comparison with other Mbo dialects and related coastal Bantu languages, and pedagogical uses in teacher training and language revitalisation contexts. The dataset was collected through the Atlas Linguistique du Cameroun (ALCAM) questionnaire framework, designed to gather basic lexical and grammatical information about Cameroonian national languages. The audio recordings were produced at the École Normale Supérieure de Yaoundé (ENS-Yaoundé) in June 2026, in the framework of the Mozilla Data Collective project. More broadly, the Kekem-ALCAM-MultimodalDataset exemplifies an approach to African language resources that highlights fluidity, orality, phonological richness, and community-based linguistic practice.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - That you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of the speaker(s) in the dataset; attempting to clone the voice or train models that imitate the speaker(s) in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment allows the training and evaluation of speech recognition models for Kekem (Mbo). It should be noted that the sentences are transcribed using the IPA alphabet. There is currently no standardised orthography widely adopted for Kekem; the General Alphabet of Cameroon's Languages (GACEL/AGLC) provides a reference framework but has not been systematically applied to this variety. - Text-to-speech (TTS): As the dataset contains clean sentence–audio pairs, it can also be used to evaluate speech synthesis or text-to-speech models for Kekem (Mbo). The use of IPA transcription rather than a conventional orthography should be taken into account when designing TTS experiments. - Speech–text alignment / forced alignment benchmarking: Fine-grained audio–text pairing provides ground truth for evaluating phoneme- or word-level aligners adapted to tonal and phonologically complex Bantu languages of the Cameroonian interior. (b) Translation and multilingual tasks: - Machine translation (Kekem ↔ French): The sentence-level alignment between Kekem and French makes the dataset a parallel corpus for evaluating translation models, with the caveat that the phonetic transcription standard differs from any conventional writing system. - Speech translation (speech-to-text) (c) Linguistic and lexicographic tasks: - Morphological analysis / glossed corpus studies: The morpheme-level glosses are valuable for computational morphology, interlinear text modelling (ILTs) and grammar induction tasks, particularly for Bantu languages of the Haut-Nkam area and the broader Mbo dialect continuum. - Lexicon and part-of-speech tagging: Useful for building linguistic resources such as dictionaries, morphological analysers, or POS taggers for Kekem and related varieties of Mbo. - Dialect comparison: The Kekem datasheet, structured in parallel with resources produced for other Mbo varieties (and the broader ALCAM framework), will facilitate systematic phonological, lexical, and morphosyntactic comparison between Kekem and related varieties, supporting variationist and typological studies of Bantu languages in the Cameroon highlands and coastal transition zone. - Language documentation: The dataset contributes to the digital documentation of the Kekem variety of Mbo, supporting efforts to extend the digital presence of this under-resourced Bantu language spoken at the crossroads of the West and Littoral regions of Cameroon.
Mbo (ISO 639-3: mbo) is a Bantu language belonging to the Southern Bantoid branch of the Niger-Congo family. It is spoken primarily in the Moungo Division of the Littoral Region of Cameroon, with the Kekem variety (sometimes referred to as "Mbo: Kekem") spoken in and around the Kekem subdivision of the Haut-Nkam Department, West Region. Despite its sociolinguistic significance within Cameroon, Mbo and its varieties remain substantially underrepresented in language technology resources. Although the Kekem subdivision is administratively part of the West Region, the indigenous Mbo speakers of this area do not identify as Bamiléké; they share closer historical and linguistic ties with the Sawa (coastal) peoples of the Littoral Region.
As documented in the Administrative Atlas of Cameroon's Languages (Breton & Bikia Fohtung 1991) and in field-linguistic records, the Mbo language comprises a set of geographically distributed speech varieties, including: Bareko, Melong, Santchou, and Kekem (the variety represented in this dataset), as well as others documented in the Moungo area (Bonkeŋ, Central-Mbo, Ehɔw, Mba, Ehɔ Mbo, Alɛ mbuu, Bakem). The present dataset covers the Kekem variety exclusively; resources for other varieties are not included in this release.
The writing system used for the transcription of Kekem in this dataset is the International Phonetic Alphabet (IPA), as reflected in lexical entries (Word) and sentence-level examples (LangEx) in the datasheet. The phonological inventory described below is derived directly from the attested forms in the LangEx and Word columns of the Kekem datasheet.
The vowel system attested in the dataset includes both oral and central vowels:
Oral vowels: i, e, ɛ, a, ɔ, o, u, ə, ʉ, ɨ
Where:
ɛ (epsilon): open-mid front unrounded vowel
ɔ (open-o): open-mid back rounded vowel
ə (schwa): mid-central vowel
ʉ (barred u): high central rounded vowel
ɨ (barred i): high central unrounded vowel
Long vowels are represented by vowel doubling (e.g., əə, ɔɔ).
The consonant inventory reflected in the dataset includes simple, prenasalised, affricate, and fricative consonants:
b, d, f, g, ɣ, h, k, kp, l, m, mb, mp, mv, n, nd, ng, nʃ, ŋ, p, r, s, sy, ʃ, t, v, w, y, z, ɲ
Special symbols:
ɣ: voiced velar fricative
ʃ: voiceless postalveolar fricative
ʼ (modifier letter apostrophe): glottal stop / glottal closure marker
The datasheet shows lexical and grammatical contrastive tones, marked directly on vowels using IPA diacritics. The following tonal categories are attested in the LangEx and Word columns:
High tone (H): á, é, ɛ́, í, ó, ɔ́, ú, ə́, ʉ́
Mid tone (M): ā, ē, ɛ̄, ī, ō, ɔ̄, ū, ə̄, ʉ̄
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù, ə̀, ʉ̀
Falling contour tone (HL): â, ê, î, ô, ɔ̂, û, ə̂
Rising contour tone (LH): ǎ, ě, ǐ, ǒ, ɔ̌, ǔ, ə̌
Unmarked vowels represent tonally neutral or contextually determined syllables.
The dataset was collected through a questionnaire designed to gather basic information about the Kekem lexicon and grammar. This was done as part of the Atlas Linguistique du Cameroun (ALCAM) project. Audio recordings were produced by a native speaker of Kekem.
The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information of the Kekem variety of Mbo.
Total audio duration: 1,127 seconds (18m 47s), distributed across 277 MP3 audio clips in 4 recording sessions.
The dataset is organised into 4 recording sessions, each corresponding to a folder containing MP3 audio clips and a per-session sentence-to-audio mapping file:
Session kekem_alcam_dataset_01: 71 clips (04m 50s)
Session kekem_alcam_dataset_02: 49 clips (03m 58s)
Session kekem_alcam_dataset_03: 88 clips (05m 44s)
Session kekem_alcam_dataset_04: 69 clips (04m 13s)
Each session folder contains:
MP3 audio clips
One per-session sentence-to-audio mapping file (mapping.tsv)
Additionally, the root of the dataset includes: 3) A structured datasheet (Kekem-ALCAM-Datasheet.tsv) with lexical entries and example sentences in IPA.
#OrigID: original number of lexical entry on paper questionnaire
#EditID: modification of #OrigID
#FrenchRef: reference entry (originally provided in French)
#FrenchComm: original comments about reference entry (#FrenchRef)
#French: lexical entry in French (overlaps with #FrenchRef)
#Note: note of researcher on the lexical entry
#POS: part of speech
#Class: noun class (where applicable)
#Morf: morphological attribute (e.g. plural, singular)
#Var: dialect variant tag
#Word: lexical entry in Kekem (IPA)
#CrossRef: cross-referencing of lexical entry number
#FrenchEx: example sentence in French
#LangEx: example sentence in Kekem (IPA)
#LangExEdit: manual editing of #LangEx
#FrenchExEdit: edited French equivalent of #FrenchEx
#LangPars: word-for-word parsing in Kekem
#LangParsEdit: editing of #LangPars
#FrenchPars: French equivalent of #LangParsEdit
#FrenchParsEdit: editing of #FrenchPars
#audio_filename: filename of the audio clip (MP3)
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker, transcribed in IPA
#attempts: number of recording attempts before acceptance
| audio file | sentence (Kekem, IPA) |
|---|---|
| c692b09d295784f5c901ffbdf3461581.mp3 | nʃwì ; yɔ́ ɣʉ mɛ̄nyǎ' nʃẅì |
| 8204f91955cfe4475e22d59ca6eaa720.mp3 | nʃwì |
| 2ac1c1d9d0c0dbaa57fef45099699376.mp3 | zə̄h ; Pʉ̄ ná' mē sʉ'ʉ nə̂h mɔ̄p |
| 0a3dd8a87e7bddd8c65977f34c5ae96c.mp3 | twī ; yɔ́ ɣʉ mā'à twī pí syà ntɔ̄k myà' |
| 81c2b0dd7eaa7b99e6cb2f9566e59d7b.mp3 | nzɔ̄ ; nzɔ́ lā:h |
| 0a24ea2f8c0976cd380fadda11cbe405.mp3 | nɛ̀p ; pə̌h mə̀ʒyɛ̀ nɛ̀p |
| d1dbe039ddc5250403b1979e2aa89f6b.mp3 | fə̀fɛ̀ ; fə̀fɛ̀ mə̀ncə́' |
| df8812bcbefb8454ca11f1dac553d876.mp3 | mə̀' ; mə̌' ní nìnyà |
| 7dd833d14dc0cacb81554bdc1384b96c.mp3 | ŋgə̀βə̀ ; à mwák èkə̀kɔ̌p yɛ̀n |
| 506170f7cf7dce47f326d87b6acf178a.mp3 | fù ; ǹ tə́m n yèé éyí mə̀wɔ̀m |
| f2292c726dbb5d2b9a11ebbc8ac02ebc.mp3 | yèē tə́ mbɔ̀ɣɔ̀ sɔ̀ɔ̄ |
| 186bdee33fa1db639e48ef717d967001.mp3 | à fœ̀p yǒmdyɛ̀: |
| 0630fdbc53a291ac1476d5b8a68dc214.mp3 | ɣū' wū' ; míī pə̌ŋkə́ bá ŋkù ə̀fə́nə̀ |
| 3e079618b8f6b5c71e0962ab98e51647.mp3 | zɨ ; nə̀ yār mə̀ |
| 5fd1001736d31d9a1af30be715a9180e.mp3 | sɨ̄ sɨ̄ ; dìɛ́ tɛ́ |