Batanga-ALCAM-MultimodalDataset

Description

Batanga-ALCAM-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of the Batanga language (ISO 639-3: bnm). Batanga is a Bantu language spoken along the Atlantic coast of the South Region of Cameroon and is rarely represented in existing grammatical descriptions, computational resources or lexicographical tools. The dataset is published in two successive releases: the present release covers the Banoho (banɔɔ) dialect; a companion datasheet for the Bapuku dialect will be integrated in a forthcoming release. The complete dataset will comprise three closely aligned components for each dialect: (i) a structured datasheet containing carefully selected example sentences and lexical entries reflecting attested usage in Batanga; (ii) high-quality audio recordings of these entries, produced by a native speaker; and (iii) an explicit audio–sentence mapping file enabling precise alignment between the textual and acoustic data. The dataset's primary added value lies in its explicit focus on Batanga, a language that, like many other coastal Bantu languages of Cameroon, remains virtually absent from reference grammars, dictionaries, educational materials and language technology resources. The Banoho and Bapuku varieties display a range of phonological and morphosyntactic features characteristic of the Cameroonian coastal Bantu area, including a complex system of vowel contrasts, nasal vowels, and lexical tone, all of which are essential for understanding the language's structural specificity and are rarely documented in machine-readable form. In this sense, the dataset contributes to a more inclusive and granular representation of African linguistic diversity. From a methodological perspective, the dataset is designed to bridge the gap between language documentation and language technology. The parallel availability of text in Batanga and in French, alongside aligned speech, makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling and multimodal language learning tools. At the same time, the structured datasheet supports linguistic analysis, contrastive studies between the Banoho and Bapuku varieties, comparison with related coastal Bantu languages, and pedagogical uses in teacher training and language revitalisation contexts. More broadly, the Batanga-ALCAM-MultimodalDataset exemplifies an approach to African language resources that highlights fluidity, orality, phonological richness and community-based linguistic practice.

Language

Batanga (ISO 639-3: bnm) is a Bantu language of the Atlantic coast of Cameroon, spoken primarily in the South Region. It belongs to the wider group of coastal Bantu languages of the Cameroonian littoral area. According to Breton and Bikia Fohtung (1991), Batanga encompasses two main speech varieties: Banoho (banɔɔ) and Bapuku. The language is rarely represented in standard grammatical descriptions or computational resources.

Variants

As documented in the Administrative Atlas of Cameroon (Breton and Bikia Fohtung 1991), the Batanga language encompasses two identified speech varieties: Banoho (banɔɔ) and Bapuku. These varieties are geographically distributed along the Atlantic coastline of the South Region of Cameroon. At the time of the present release, the dataset includes resources for the Banoho dialect only; a datasheet for Bapuku will be incorporated in a forthcoming release. A full systematic comparative description of the phonological and morphosyntactic variation across the two varieties is not yet available.

Writing System

The writing system used for the transcription of Batanga in this dataset is the International Phonetic Alphabet (IPA), as reflected in lexical entries (Word) and sentence-level examples (LangEx) in the datasheet. The phonological inventory described below is derived directly from the attested forms in the LangEx and Word columns of the Banoho datasheet.

1. Vowels

The vowel system attested in the dataset includes both oral and nasal vowels:

Oral vowels: i, e, ɛ, a, ɔ, o, u

Nasal vowels: ĩ, ẽ, ã, õ, ũ

These vowels occur with and without tone marking in lexical items and running text.

2. Consonants

The consonant inventory reflected in the dataset includes simple, prenasalized, and affricate consonants:

b, d, f, g, h, k, kp, l, m, mb, mp, mv, n, nd, ng, ŋ, p, r, s, t, v, w, y, z, ɲ

These consonants appear consistently across noun stems, verbal forms, derivational patterns and noun-class alternations.

3. Tone system

The datasheet shows lexical and grammatical contrastive tones, marked directly on vowels. The following tonal categories are attested in the LangEx and Word columns:

High tone (H): á, é, ɛ́, í, ó, ɔ́, ú
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù
Falling contour tone (HL): â, ê, î, ô, ɔ̂, û
Rising contour tone (LH): ǎ, ě, ǐ, ǒ, ɔ̌, ǔ
Mid tone (M): ā, ē, ō, ɔ̄

Unmarked vowels represent tonally neutral or contextually determined syllables. Nasal vowels also carry tone distinctions.

Source

The dataset was collected through a questionnaire designed to gather basic information about the Batanga lexicon and grammar. This was done as part of the Atlas Linguistique du Cameroun (ALCAM) project.

Domain

The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.

Size

To be completed upon finalisation of both dialect releases.

Structure

The current release comprises resources for the Banoho (banɔɔ) dialect: 1) a datasheet (ALCAM_dataset_Batanga_Banoho.tsv) with lexical entries and example sentences; 2) voice clips read by a native speaker of Banoho; 3) a sentence-to-audio mapping file (mapping_Banoho.tsv). Upon completion, the full dataset will additionally include: 4) a datasheet for the Bapuku dialect (ALCAM_dataset_Batanga_Bapuku.tsv); 5) audio recordings for Bapuku; 6) a sentence-to-audio mapping file for Bapuku (mapping_Bapuku.tsv).

Description of columns

#OrigID: original number of lexical entry on paper questionnaire
#EditID: modification of #OrigID
#FrenchRef: reference entry (originally provided in French)
#FrenchComm: original comments about reference entry (#FrenchRef)
#French: lexical entry in French (overlaps with #FrenchRef)
#Note: note of researcher on the lexical entry
#POS: part of speech
#Class: noun class (where applicable)
#Morf: morphological attribute (e.g. plural, singular)
#Var: dialect variant tag (Banoho / Bapuku)
#Word: lexical entry in Batanga (IPA)
#CrossRef: cross-referencing of lexical entry number
#FrenchEx: example sentence in French
#LangEx: example sentence in Batanga (IPA)
#LangExEdit: manual editing of #LangEx
#FrenchExEdit: edited French equivalent of #FrenchEx
#LangPars: word-for-word parsing in Batanga
#LangParsEdit: editing of #LangPars
#FrenchPars: French equivalent of #LangParsEdit
#FrenchParsEdit: editing of #FrenchPars

Sample

To be completed upon release.