License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmqf6aecs066smk07udkk7ocn
Task: TTS
Release Date: 6/15/2026
Format: MP3, TSV
Size: 79.13 MB
Share
Batanga-TTS-Dataset is a scripted speech dataset dedicated to the documentation and technological development of Batanga (ISO 639-3: bnm), a Bantu language spoken along the Atlantic coast of the Ocean Division, South Region of Cameroon. The dataset was compiled in the framework of the Mozilla Data Collective initiative (2026), as a supplement to the Common Voice Scripted Speech 25.0 – Batanga dataset (https://mozilladatacollective.com/datasets/cmn2ca0qt01fomm07ivn5e89r) and the Tupuri-ALCAM-MultimodalDataset (https://mozilladatacollective.com/datasets/cmq9htcx402gbmk07g00d1u67). The dataset comprises 1,023 high-quality MP3 audio recordings of Batanga sentences read by a native speaker across 11 recording sessions, together with per-session sentence-to-audio mapping files enabling precise alignment between textual and acoustic data. Sentences were drawn from a scripted speech prompt list and read in a controlled environment. The dataset represents both principal speech varieties of Batanga — Bapuku and Banoho (also spelt Banoo or Banɔɔ). The transcription of all sentences follows the General Alphabet of Cameroon's Languages (AGLC; French acronym: Alphabet Général des Langues Camerounaises), the reference standard for Cameroonian national languages. The Batanga orthography employed in this dataset is distinguished by an extended vowel inventory — including the open-mid front unrounded vowel ɛ and the open-mid back rounded vowel ɔ — as well as a tone-marking system using acute and grave diacritics applied to vowels, nasal-consonant boundary markers (nʼ, ŋʼ) separating nasal prefixes from consonant clusters, and the eng symbol (ŋ) for the velar nasal. The parallel availability of AGLC-transcribed text and aligned speech makes the dataset suitable for a wide range of applications, including text-to-speech (TTS) synthesis, automatic speech recognition (ASR), forced alignment, pronunciation modelling, and language learning tools. It also directly supports efforts to standardise and normalise the digital representation of Batanga in language technology contexts.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - That you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of any speaker in the dataset; attempting to clone any voice or train models that imitate any speaker in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
(a) Speech-related tasks: - Text-to-speech (TTS) synthesis: The dataset provides clean sentence–audio pairs from multiple recording sessions and is directly suited for training, fine-tuning, and evaluating speech synthesis models for Batanga. The availability of AGLC-transcribed sentences with aligned audio enables the development of TTS systems capable of producing natural-sounding Batanga speech across both the Bapuku and Banoho varieties. - Automatic speech recognition (ASR): Audio–text alignment enables the training and evaluation of speech recognition models for Batanga. The per-session structure and controlled recording conditions make the dataset suitable for building and evaluating ASR models for this under-resourced coastal Bantu language. - Speech–text alignment / forced alignment benchmarking: Fine-grained audio–text pairing provides ground truth for evaluating phoneme- or word-level aligners adapted to coastal Bantu languages of the South Region of Cameroon. - Pronunciation modelling: The AGLC-transcribed sentences, combined with aligned audio, provide a resource for developing grapheme-to-phoneme (G2P) models and pronunciation lexicons for Batanga. (b) Linguistic and lexicographic tasks: - Phonological analysis: The dataset enables systematic study of the phonological and tonal system of Batanga, including its lexical tone system, extended vowel inventory (ɛ, ɔ), prenasalised consonant inventory, nasal prefix boundary marking (nʼ, ŋʼ), and the phonological contrasts operative across the Bapuku and Banoho varieties. - Orthographic standardisation and normalisation: The dataset can serve as a reference corpus for evaluating and training text normalisation models aligned with the AGLC standard for Batanga, complementing the IPA-transcribed data available in the Batanga-ALCAM-MultimodalDataset. - Dialect comparison: Combined with the Batanga-ALCAM-MultimodalDataset, the present resource supports systematic phonological, lexical and morphosyntactic comparison between the Bapuku and Banoho varieties of Batanga, as well as broader variationist and typological studies of the coastal Bantu area of Cameroon. - Language documentation: The dataset contributes to the digital documentation of Batanga scripted speech in AGLC orthography, supporting efforts to extend the digital presence of this coastal Bantu language of the South Region of Cameroon.
Batanga (ISO 639-3: bnm) is a Bantu language spoken along the Atlantic coast of Cameroon, primarily in the Ocean Division of the South Region. It belongs to the wider group of coastal Bantu languages of the Cameroonian littoral area. Despite its sociolinguistic significance within Cameroon, Batanga remains substantially underrepresented in language technology resources.
According to the Administrative Atlas of Cameroon's Languages (Breton & Bikia Fohtung 1991), Batanga encompasses two main speech varieties:
Banoho (also spelt Banoo or Banɔɔ)
Bapuku
The present dataset represents both varieties.
The writing system used for the transcription of Batanga in this dataset is the General Alphabet of Cameroon's Languages (AGLC). The AGLC provides a phonologically motivated orthographic standard for Cameroonian national languages and serves as the reference framework for Batanga literacy materials. Note that the Batanga-ALCAM-MultimodalDataset, a companion resource, employs the International Phonetic Alphabet (IPA); the present TTS dataset uses AGLC throughout.
The vowel system attested in the dataset includes the following oral vowels:
a, e, i, o, u, ɛ, ɔ
Where:
ɛ (epsilon): open-mid front unrounded vowel
ɔ (open-o): open-mid back rounded vowel
Long vowels are represented by vowel doubling where attested.
The consonant inventory reflected in the dataset includes simple, digraph, and prenasalised consonants:
b, d, f, g, h, j, k, l, m, mb, mp, mv, n, nd, nj, ng, ŋ, p, r, s, t, v, w, y
Special symbols:
ŋ (eng): velar nasal consonant
ny: palatalised nasal, used in sequences such as nyw
Batanga attests nasal prefixes that attach to consonant clusters. These are represented in the dataset using the apostrophe as a boundary marker between the nasal and the following consonant:
nʼ: nasal alveolar prefix before consonants (e.g., nʼtindi, nʼkuta, nʼlesi, nʼtamwi)
ŋʼ: nasal velar prefix before consonants (e.g., ŋʼhɔbɛ)
Batanga is a tonal language with contrastive pitch levels. The dataset employs systematic tone marking on vowels in accordance with the AGLC convention. The following diacritics are attested in the dataset:
Level tones:
High tone (H): acute accent — á, é, í, ó, ú, ɛ́, ɔ́
Low tone (L): grave accent — à, è, ì, ò, ù, ɛ̀, ɔ̀
Unmarked vowels represent tonally neutral or contextually determined syllables.
The dataset was compiled from scripted speech prompt lists read by a native speaker. Sentences were selected to provide broad phonological coverage of Batanga and to represent both the Bapuku and Banoho speech varieties. Sentences were transcribed in accordance with the AGLC orthographic standard.
The dataset represents scripted speech in Batanga, covering a broad range of everyday sentence types drawn from a general-purpose TTS/ASR prompt list. All utterances are scripted rather than spontaneous.
Total audio duration: 5,231 seconds (01h 27m 11s), distributed across 1,023 MP3 audio clips in 11 recording sessions.
The dataset is organised into 11 recording sessions:
Session tts_dataset_bnm_01: 100 clips (11m 25s)
Session tts_dataset_bnm_02: 100 clips (8m 46s)
Session tts_dataset_bnm_03: 100 clips (8m 03s)
Session tts_dataset_bnm_04: 100 clips (6m 54s)
Session tts_dataset_bnm_05: 100 clips (8m 28s)
Session tts_dataset_bnm_06: 99 clips (8m 21s)
Session tts_dataset_bnm_07: 100 clips (8m 29s)
Session tts_dataset_bnm_08: 100 clips (7m 54s)
Session tts_dataset_bnm_09: 100 clips (7m 36s)
Session tts_dataset_bnm_10: 100 clips (8m 48s)
Session tts_dataset_bnm_11: 24 clips (2m 22s)
Each session folder contains:
MP3 audio clips
One per-session sentence-to-audio mapping file (mapping.tsv), with 4 columns
#audio_filename: filename of the audio clip (MP3)
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker, transcribed in AGLC orthography
#attempts: number of recording attempts before acceptance
| audio file | sentence (Batanga, AGLC) |
|---|---|
| b63f3b4db8951b7fcf0aff4d23544500.mp3 | Ihomakandi etomba ya Bapuku yɔ́ yɛ́hɛ́pi mongolo. |
| 22d53bbeef1756720d99012c25492c20.mp3 | Indi na elombɛ yámɛ évahasɛ ilangwana nywɛ. |
| 4281aea830368defd4d8b38032fdb21a.mp3 | Haka kahá iyɛnɛ j'ɔvɛ dongwango na matɛdɛ ma ebota. |
| 84657740b10b62da39f550bf242fe4bb.mp3 | Mutowa mwawu múlanidi mútamwindi iwamidɛ. |
| 866415a087bdd84eca233371db579ac5.mp3 | Yongowa ehúnjandi ó iwa ó múnja epɔhɔ ó nʼtindi. |