License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmq9htcx402gbmk07g00d1u67
Task: NLP
Release Date: 6/11/2026
Format: MP3, TSV
Size: 11.67 MB
Share
Batanga-ALCAM-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of the Batanga language (ISO 639-3: bnm). Batanga is a Bantu language spoken along the Atlantic coast of the South Region of Cameroon and is rarely represented in existing grammatical descriptions, computational resources or lexicographical tools. The dataset is published in two successive releases: the present release covers the Banoho (banɔɔ) dialect; a companion datasheet for the Bapuku dialect will be integrated in a forthcoming release. The complete dataset will comprise three closely aligned components for each dialect: (i) a structured datasheet containing carefully selected example sentences and lexical entries reflecting attested usage in Batanga; (ii) high-quality audio recordings of these entries, produced by a native speaker; and (iii) an explicit audio–sentence mapping file enabling precise alignment between the textual and acoustic data. The dataset's primary added value lies in its explicit focus on Batanga, a language that, like many other coastal Bantu languages of Cameroon, remains virtually absent from reference grammars, dictionaries, educational materials and language technology resources. The Banoho and Bapuku varieties display a range of phonological and morphosyntactic features characteristic of the Cameroonian coastal Bantu area, including a complex system of vowel contrasts, nasal vowels, and lexical tone, all of which are essential for understanding the language's structural specificity and are rarely documented in machine-readable form. In this sense, the dataset contributes to a more inclusive and granular representation of African linguistic diversity. From a methodological perspective, the dataset is designed to bridge the gap between language documentation and language technology. The parallel availability of text in Batanga and in French, alongside aligned speech, makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling and multimodal language learning tools. At the same time, the structured datasheet supports linguistic analysis, contrastive studies between the Banoho and Bapuku varieties, comparison with related coastal Bantu languages, and pedagogical uses in teacher training and language revitalisation contexts. More broadly, the Batanga-ALCAM-MultimodalDataset exemplifies an approach to African language resources that highlights fluidity, orality, phonological richness and community-based linguistic practice.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - that you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of the speaker(s) in the dataset; attempt to clone the voice or train models that imitate the speaker(s) in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment allows the evaluation of speech recognition models for Batanga. It should be noted that the sentences are transcribed using the IPA alphabet. There is currently no standardised orthography widely adopted for Batanga; the General Alphabet of Cameroon's Languages (GACEL) provides a reference framework but has not been systematically applied to this language. - Text-to-speech (TTS): As the dataset contains clean sentence–audio pairs, it can also be used to evaluate speech synthesis or text-to-speech models. The use of IPA transcription rather than a conventional orthography should be taken into account when designing TTS experiments. - Speech–text alignment/forced alignment benchmarking: Fine-grained audio–text pairing provides ground truth for evaluating phoneme- or word-level aligners adapted to tonal and phonologically complex African languages. (b) Translation and multilingual tasks: - Machine translation (Batanga ↔ French): The sentence-level alignment between Batanga and French makes it a parallel corpus for evaluating translation models, with the caveat that the phonetic orthographic standard differs from any conventional writing system. - Speech translation (speech-to-text) (c) Linguistic and lexicographic tasks: - Morphological analysis/glossed corpus studies: The morpheme-level glosses are valuable for computational morphology, interlinear text modelling (ILTs) and grammar induction tasks, particularly for coastal Bantu languages of Cameroon. - Lexicon and part-of-speech tagging: Useful for building linguistic resources such as dictionaries, morphological analysers or POS taggers for Batanga and related coastal Bantu languages of the South Region of Cameroon. - Dialect comparison: The parallel structure of the Banoho and Bapuku datasheets (once both are released) will facilitate systematic phonological, lexical and morphosyntactic comparison between the two varieties of Batanga, supporting variationist and typological studies of the coastal Bantu area.
Batanga (ISO 639-3: bnm) is a Bantu language of the Atlantic coast of Cameroon, spoken primarily in the South Region. It belongs to the wider group of coastal Bantu languages of the Cameroonian littoral area. According to Breton and Bikia Fohtung (1991), Batanga encompasses two main speech varieties: Banoho (banɔɔ) and Bapuku. The language is rarely represented in standard grammatical descriptions or computational resources.
As documented in the Administrative Atlas of Cameroon (Breton and Bikia Fohtung 1991), the Batanga language encompasses two identified speech varieties: Banoho (banɔɔ) and Bapuku. These varieties are geographically distributed along the Atlantic coastline of the South Region of Cameroon. At the time of the present release, the dataset includes resources for the Banoho dialect only; a datasheet for Bapuku will be incorporated in a forthcoming release. A full systematic comparative description of the phonological and morphosyntactic variation across the two varieties is not yet available.
The writing system used for the transcription of Batanga in this dataset is the International Phonetic Alphabet (IPA), as reflected in lexical entries (Word) and sentence-level examples (LangEx) in the datasheet. The phonological inventory described below is derived directly from the attested forms in the LangEx and Word columns of the Banoho datasheet.
The vowel system attested in the dataset includes both oral and nasal vowels:
Oral vowels: i, e, ɛ, a, ɔ, o, u
Nasal vowels: ĩ, ẽ, ã, õ, ũ
These vowels occur with and without tone marking in lexical items and running text.
The consonant inventory reflected in the dataset includes simple, prenasalized, and affricate consonants:
b, d, f, g, h, k, kp, l, m, mb, mp, mv, n, nd, ng, ŋ, p, r, s, t, v, w, y, z, ɲ
These consonants appear consistently across noun stems, verbal forms, derivational patterns and noun-class alternations.
The datasheet shows lexical and grammatical contrastive tones, marked directly on vowels. The following tonal categories are attested in the LangEx and Word columns:
High tone (H): á, é, ɛ́, í, ó, ɔ́, ú
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù
Falling contour tone (HL): â, ê, î, ô, ɔ̂, û
Rising contour tone (LH): ǎ, ě, ǐ, ǒ, ɔ̌, ǔ
Mid tone (M): ā, ē, ō, ɔ̄
Unmarked vowels represent tonally neutral or contextually determined syllables. Nasal vowels also carry tone distinctions.
The dataset was collected through a questionnaire designed to gather basic information about the Batanga lexicon and grammar. This was done as part of the Atlas Linguistique du Cameroun (ALCAM) project.
The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.
To be completed upon finalisation of both dialect releases.
The current release comprises resources for the Banoho (banɔɔ) dialect: 1) a datasheet (ALCAM_dataset_Batanga_Banoho.tsv) with lexical entries and example sentences; 2) voice clips read by a native speaker of Banoho; 3) a sentence-to-audio mapping file (mapping_Banoho.tsv). Upon completion, the full dataset will additionally include: 4) a datasheet for the Bapuku dialect (ALCAM_dataset_Batanga_Bapuku.tsv); 5) audio recordings for Bapuku; 6) a sentence-to-audio mapping file for Bapuku (mapping_Bapuku.tsv).
#OrigID: original number of lexical entry on paper questionnaire
#EditID: modification of #OrigID
#FrenchRef: reference entry (originally provided in French)
#FrenchComm: original comments about reference entry (#FrenchRef)
#French: lexical entry in French (overlaps with #FrenchRef)
#Note: note of researcher on the lexical entry
#POS: part of speech
#Class: noun class (where applicable)
#Morf: morphological attribute (e.g. plural, singular)
#Var: dialect variant tag (Banoho / Bapuku)
#Word: lexical entry in Batanga (IPA)
#CrossRef: cross-referencing of lexical entry number
#FrenchEx: example sentence in French
#LangEx: example sentence in Batanga (IPA)
#LangExEdit: manual editing of #LangEx
#FrenchExEdit: edited French equivalent of #FrenchEx
#LangPars: word-for-word parsing in Batanga
#LangParsEdit: editing of #LangPars
#FrenchPars: French equivalent of #LangParsEdit
#FrenchParsEdit: editing of #FrenchPars
To be completed upon release.