Diboum-ALCAM-MultimodalDataset

Description

Diboum_ALCAM-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of the Diboum variety of Basaa (ISO 639-3: bas), a Bantu language of Cameroon. Diboum is a localised and socially embedded speech form that is rarely represented in standard grammatical descriptions or lexicographical resources. The dataset comprises three closely aligned components: (i) a structured datasheet containing carefully selected example sentences reflecting casual, albeit non-authentic, usage in the Diboum variety; (ii) high-quality audio recordings of these sentences, produced by a native speaker; and (iii) an explicit audio–sentence mapping file enabling precise alignment between the textual and acoustic data. The dataset's primary added value lies in its explicit focus on the Diboum variety of Basaa. Diboum is classified as a dialect of Basaa (bas) both by the Ethnologue (https://www.ethnologue.com/language/bas/) and in the standard reference atlases of Cameroon's languages: the Atlas Linguistique du Cameroun by Breton and Bikia Fohtung (1991) and the Atlas Linguistique de l'Afrique Centrale: le Cameroun by Bibam Bikoi (2012). Like many other geographically and socially situated varieties of Basaa, Diboum typically remains invisible in reference grammars, dictionaries and educational materials that often privilege more standardised or better-documented forms of the language. The dataset captures micro-variation in phonetics, phonology, morphosyntax and lexical choice that are essential for understanding socially situated linguistic practices rather than a homogeneous, abstract system. In this sense, the dataset contributes to a more inclusive representation of linguistic diversity within the Basaa speech community. From a methodological perspective, the dataset is designed to bridge the gap between language documentation and language technology. The parallel availability of text in the Diboum variety and in French, alongside aligned speech, makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling and multimodal language learning tools. At the same time, the structured datasheet supports linguistic analysis, contrastive studies with other language varieties and pedagogical uses in teacher training and language revitalisation contexts. More broadly, the Diboum_ALCAM-MultimodalDataset exemplifies an approach to African language resources that highlights fluidity, longitudinal variation, orality and community-based practice.

Language

Diboum is classified as a dialect of Basaa (ISO 639-3: bas) by the Ethnologue (https://www.ethnologue.com/language/bas/) and in the two standard atlases of Cameroon's languages: the Atlas Linguistique du Cameroun (Breton and Bikia Fohtung 1991) and the Atlas Linguistique de l'Afrique Centrale: le Cameroun (Bibam Bikoi 2012). Basaa belongs to the Bantu branch of the Niger-Congo language family (Guthrie zone A.43). Basaa speakers are located primarily in the Littoral Region of Cameroon, in the Nkam and Nyong-et-Kéllé Divisions, as well as in the Centre Region. The Diboum variety is spoken in the Nkam Division in the Littoral Region.

Variants

At the time of publication of this dataset, we do not have a precise idea of the full scope of variation of Diboum, a variety which is itself considered a component of the Basaa dialect continuum. The relationship between Diboum and other attested varieties of Basaa (e.g. Mbaa, Ndog-bikim, Hijuk) has not been systematically characterised in the available literature.

Writing System

The writing system used for the transcription of Diboum in this dataset is the International Phonetic Alphabet (IPA), as reflected in lexical entries (Word) and sentence-level examples (LangEx) in the datasheet. The phonological inventory described below is derived directly from the attested forms in the LangEx and Word columns of the datasheet.

1. Vowels

The vowel system attested in the dataset is as follows:

i, e, ɛ, a, ɔ, o, u, ə

2. Consonants

The consonant inventory reflected in the dataset includes the following simple, prenasalised, labialised and other consonants:

b, ɓ, by, c, d, dz, f, g, h, k, kw, l, m, mb, mv, n, nd, ng, ŋ, ŋg, p, r, s, t, v, w, y, z, ɲ, ɟ

These consonants appear consistently across noun stems, verbal forms, derivational patterns and noun-class alternations (e.g. ɲɔ̀ 'mouth', dìs 'eye', kíŋ 'head', hù 'ear', bì-kíŋ 'heads', mà-hù 'ears').

3. Tone system

The datasheet shows lexical and grammatical contrastive tones, marked directly on vowels and on sonorant consonants m and n. The following tonal categories are attested in the LangEx column:

High tone (H): á, é, ɛ́, í, ó, ɔ́, ú, ə́, ń, ḿ
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù, ə̀, ǹ, m̀
Falling contour tone (HL): â, ê, î, ô, ɔ̂, û
Rising contour tone (LH): ǎ, ě, ǐ, ǒ, ǔ
Mid/level tone: attested on a restricted set of items, marked with macron (e.g. lāyū, mā-bē)

Unmarked vowels represent tonally neutral or contextually determined syllables.

4. Noun class system

The data reflects an active noun class system typical of Bantu languages, with prefixes marking singular/plural alternations (e.g. dì-sòŋ / mà-sòŋ 'ear of maize / ears of maize'; bì-kíŋ / kíŋ 'heads / head'; è-lím / bì-lím 'tongue / tongues'). The class prefixes attested in the dataset include: à-, bà-, bì-, bí-, by-, dì-, è-, mà-, mì-, m̀-, among others.

Source

The dataset was collected through a questionnaire designed to gather basic information about the Diboum lexicon and grammar. This was done as part of the Atlas Linguistique du Cameroun (ALCAM) project.

Domain

The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.

Size

Total size is approximately 8.6 MB (uncompressed), comprising 8.52 MB of MP3 audio files and approximately 100 KB of TSV data files.

The total duration of the 337 audio recordings is 1068.9 seconds (17 minutes 49 seconds).

Structure

The dataset comprises: 1) a datasheet (Diboum-ALCAM-MultimodalDataset.tsv) with 375 lines and 20 columns; 2) 337 voice clips read by a single native speaker, stored as MP3 files in the tts_dataset subfolder; 3) a sentence-to-audio mapping file (mapping.tsv) with 337 lines and 4 columns.

Description of columns (datasheet)

#OrigID: original number of lexical entry on paper questionnaire
#EditID: modification of #OrigID
#FrenchRef: reference entry (originally provided in French)
#FrenchComm: original comments about reference entry (#FrenchRef)
#French: lexical entry in French (overlaps with #FrenchRef)
#Note: note of researcher on the lexical entry
#POS: part of speech
#Class: noun class (where applicable)
#Morf: morphological attribute (e.g. plural, singular)
#Var: (na)
#Word: lexical entry in Diboum
#CrossRef: cross-referencing of lexical entry number
#FrenchEx: example sentence in French
#LangEx: example sentence in Diboum
#LangExEdit: manual editing of #LangEx
#FrenchExEdit: edited French equivalent of #FrenchEx
#LangPars: word-for-word parsing in Diboum
#LangParsEdit: editing of #LangPars
#FrenchPars: French equivalent of #LangParsEdit
#FrenchParsEdit: editing of #FrenchPars

Description of columns (mapping file)

#audio_filename: name of the MP3 audio file
#key: MD5-based identifier shared with the audio filename (without extension)
#sentence: lexical item and/or example sentence read by the speaker
#attempts: number of recording attempts before the selected take

Sample

audio file	words & sentences
53fa2dccd478e3d6519d580acc8e76e7.mp3	ɲɔ̀ ; à bí ɲɔ̀ sà
91ceb854c1c06bea68557601cbe1c4fe.mp3	mì-ɲɔ̀
72b1abf8f3504dde028abe3c312e6762.mp3	dìs ; bá béká báà tò mìs
1a140de194c2168aa188bb12b392227e.mp3	mìs
3975c30a65b70ff6fde1563a399585a8.mp3	m̀-ró ; à bí m-ró keŋ dì kíŋ rà
2badc3166db8b1509da93d2ec05f0750.mp3	hù ; mà-hù mé má ná lāyū
3c164ed5ffb0fac8eba9074e56b09da1.mp3	dì-sòŋ ; mà-sòŋ má m-byó máā
7ccd896bbcd26882efbdd66ddf24b69b.mp3	è-lím ; à kòk(ò) lá è-lím
40d7509cb8bf58f395dbd6e12f327751.mp3	mà-sòŋ
02240d7b41a7a175c20a0f854fbf4deb.mp3	mà-hù