License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmq9hqu3k02jbl2074u0yvk2i
Task: ASR
Release Date: 6/11/2026
Format: MP3, TSV
Size: 42.11 MB
Share
Bulu-ASR-Dataset is a scripted speech dataset dedicated to the documentation and technological development of Bulu (ISO 639-3: bum), a Narrow Bantu language spoken primarily in the South and Centre Regions of Cameroon. The dataset was compiled at the École Normale Supérieure de Yaoundé (2026). The dataset comprises 819 high-quality MP3 audio recordings of Bulu sentences read by 8 native speakers across 9 recording sessions, together with per-session sentence-to-audio mapping files enabling precise alignment between textual and acoustic data. Sentences were drawn from a scripted speech prompt list and read by each speaker in a controlled environment. The primary added value of this dataset lies in its demographic composition: the majority of contributing speakers are female, a demographic group that is significantly underrepresented in the existing Common Voice Scripted Speech 25.0 – Bulu dataset available on the Mozilla Data Collective platform. By providing a substantial body of high-quality female Bulu speech, this dataset directly addresses the speaker gender imbalance in available Bulu speech resources and enables the development and evaluation of speech technology models that are more robust across speaker genders. The dataset follows the orthography established by the American Presbyterian Mission (Mission Protestante Américaine, MPA), the historically grounded and community-recognised writing standard for Bulu. This orthography, developed by MPA missionaries and Bulu-speaking collaborators from the late nineteenth century onwards, was codified through the Bulu Bible translation, grammar descriptions, and literacy materials that have shaped Bulu literacy for over a century. From a methodological perspective, the dataset is designed to complement the existing Common Voice Scripted Speech resource for Bulu rather than to replace it, thereby extending the total amount of available Bulu speech data while improving demographic coverage and orthographic fidelity. The parallel availability of MPA-transcribed text and aligned speech makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), forced alignment, pronunciation modelling and language learning tools.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - That you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of any speaker in the dataset; attempting to clone any voice or train models that imitate any speaker in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment enables the evaluation of speech recognition models for Bulu. Sentences are transcribed in the MPA orthographic standard with systematic diacritic notation for phonemically contrastive sounds (including the vowel distinctions o/ô, e/é, and the laryngeal consonant), which distinguishes this dataset from the Common Voice Scripted Speech 25.0 – Bulu resource and makes it particularly suited for building and evaluating gender-balanced, phonemically precise ASR models. - Text-to-speech (TTS): The dataset contains clean sentence–audio pairs from multiple speakers, the majority of whom are female, and can be used to evaluate or fine-tune speech synthesis models for Bulu, particularly for female voice profiles. The MPA orthographic standard, including the vowel diacritics (ô, é) and the laryngeal consonant ('), should be taken into account when designing TTS front-end text normalisation and grapheme-to-phoneme components. - Speech–text alignment / forced alignment benchmarking: Fine-grained audio–text pairing provides ground truth for evaluating phoneme- or word-level aligners adapted to tonal Bantu languages. (b) Linguistic and lexicographic tasks: - Phonological analysis: The systematic diacritic notation in MPA orthography makes the dataset suitable for studying the phonemic vowel contrasts (o/ô, e/é) and the laryngeal consonant in Bulu, as well as the distribution of the palatal nasal (ñ/ň) and other consonant phonemes. The dataset may also support research into the interaction between phonemic vowel quality and prosodic structure in Bulu. - Orthographic standardisation and normalisation: The dataset can serve as a reference corpus for evaluating and training text normalisation and grapheme-to-phoneme (G2P) models aligned with the MPA orthographic standard, which is the historically dominant literacy norm for Bulu. The need to correctly handle precomposed diacritic letters (ô, é) and the laryngeal consonant (') makes this dataset a valuable testbed for G2P systems targeting Bulu. - Language documentation: The dataset contributes to the digital documentation of Bulu scripted speech in MPA orthography, extending the existing Common Voice resource with orthographically principled transcriptions and improved demographic coverage. - Gender-balanced speech resource development: The predominantly female speaker composition makes this dataset a valuable resource for research into gender equity in speech technology, and for training and evaluating models that must generalise across speaker genders for Bulu.
Bulu is a Narrow Bantu language belonging to the Beti-Fang group of the Benue-Congo branch. It is indigenous to a population located primarily in the South Region of Cameroon, with significant speech communities in the Centre Region. Bulu is closely related to Ewondo, Fang, Ntumu and Eton within the Beti-Fang macro-language group. Ethnologue estimates the number of speakers at approximately 800,000, including first- and second-language users. Despite its relatively large speaker base, Bulu remains significantly underrepresented in language technology resources.
The glossonym 'Bulu' designates a set of closely related linguistic varieties whose speakers may or may not identify with the label, depending on geographical, social and pragmatic factors. In the framework of the Atlas Linguistique du Cameroun (ALCAM) project, Bulu is listed as one of the major micro-languages of the Beti-Fang macro-language, alongside Fang, Ewondo, Ntumu and Eton. Sub-varieties such as Yelinda and Bane are considered varieties of Bulu in standard classifications, though this classification is not always accepted by all their speakers.
The present dataset represents speakers of the Bulu variety as spoken in the Yaoundé area (Centre Region), recruited at the École Normale Supérieure de Yaoundé.
The writing system used for the transcription of Bulu in this dataset is the orthographic standard of the American Presbyterian Mission (Mission Protestante Américaine, MPA). This orthography was developed from the late nineteenth century onwards by MPA missionaries working in close collaboration with Bulu-speaking communities in the South Region of Cameroon, and was progressively consolidated through a series of foundational works: grammars, dictionaries, and above all the full translation of the Bible into Bulu, whose various editions spanning from the early twentieth century to the present constitute the most widely circulated corpus of written Bulu. The MPA orthographic standard has thus been the dominant and community-recognised literacy norm for Bulu for over a century and remains in active use in religious, educational and everyday written communication.
The MPA orthography employs diacritics on vowels to represent phonemically distinct sounds, in a manner analogous to French orthographic conventions. Diacritics signal differences in vowel quality rather than suprasegmental properties such as tone. The principal vowel contrasts attested in the dataset are:
o vs. ô: two phonemically distinct o-quality vowels. ô (LATIN SMALL LETTER O WITH CIRCUMFLEX) is the most frequent diacriticized character in the dataset (×821) and marks a vowel phoneme systematically distinct from plain o (e.g. ôse, ôkon, lôn, jôm, yôp, mongô).
e vs. é: two phonemically distinct e-quality vowels. é (LATIN SMALL LETTER E WITH ACUTE) is the second most frequent diacriticized character (×526) and marks a vowel phoneme systematically distinct from plain e (e.g. ésa, éte, jôé, meté).
The remaining base vowels are a, i, u. Long vowels are represented by vowel doubling (e.g. aa, ee, oo).
The consonant inventory reflected in the dataset includes simple, prenasalized and digraph consonants, as well as two specially marked consonants:
b, d, dz, f, g, h, j, k, l, m, mb, mv, n, nd, ng, nk, nz, ny, p, s, t, ts, v, w, y, z
Special consonants:
' (apostrophe / RIGHT SINGLE QUOTATION MARK, U+2019): marks the laryngeal consonant, phonemically contrastive in Bulu (e.g. ndô'ôtô, wô'ô, ve'ele, tu'a, fô'ôsan).
ñ / ň: marks the palatal nasal consonant /ɲ/ (e.g. éyoñ, minlaň, nyoňe). Both graphemes (LATIN SMALL LETTER N WITH TILDE and LATIN SMALL LETTER N WITH CARON) occur in the dataset and represent the same sound; this orthographic variation reflects the source texts from which prompt sentences were drawn.
The dataset was compiled from scripted speech prompt lists read by native speakers of Bulu in recording sessions held at the École Normale Supérieure de Yaoundé in 2026, in the framework of the Mozilla Data Collective project. Sentences were selected to provide broad phonological coverage of Bulu and were transcribed in accordance with the MPA orthographic standard, with systematic diacritic notation for phonemically contrastive vowels and consonants.
The dataset represents scripted speech in Bulu, covering a broad range of everyday sentence types drawn from a general-purpose ASR/TTS prompt list. All utterances are scripted rather than spontaneous.
Total audio duration: 3,468 seconds (00:57:48), distributed across 819 MP3 audio clips in 9 recording sessions contributed by 8 native speakers of Bulu.
The dataset comprises:
819 MP3 audio clips read by 8 native speakers of Bulu, with a total duration of 3,468 seconds (00:57:48), distributed across 9 recording sessions:
Session bum_01: 100 clips (5m 46s)
Session bum_02: 100 clips (8m 44s)
Session bum_02-1: 100 clips (6m 08s)
Session bum_03: 100 clips (8m 35s)
Session bum_04: 100 clips (4m 57s)
Session bum_05: 100 clips (6m 55s)
Session bum_06: 100 clips (7m 37s)
Session bum_07: 100 clips (8m 05s)
Session bum_08: 19 clips (0m 56s)
Nine per-session sentence-to-audio mapping files (mapping.tsv), each with 4 columns.
#audio_filename: filename of the audio clip (MP3)
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker, transcribed in MPA orthography with diacritic notation for phonemically contrastive sounds
#attempts: number of recording attempts before acceptance
| audio file | sentence (Bulu, MPA orthography) |
|---|---|
| dab6fa24008dd83e7f212d99d3f4eb35.mp3 | si é mbe |
| 15c5db6891149271f924ed57c04ce097.mp3 | mot ane jôé na Ondo Mba |
| 53610d4a2b58ee24a154e380533a7f50.mp3 | ndô'ôtô ô nji tu'a wô'ô jôm é ne meté |
| 73df4abb3241efec595196835c91f5b7.mp3 | à nga kalan je yôp à ve'ele wulu je yôp |