Bulu-ASR-Dataset

Description

Bulu-ASR-Dataset is a scripted speech dataset dedicated to the documentation and technological development of Bulu (ISO 639-3: bum), a Narrow Bantu language spoken primarily in the South and Centre Regions of Cameroon. The dataset was compiled at the École Normale Supérieure de Yaoundé (2026). The dataset comprises 819 high-quality MP3 audio recordings of Bulu sentences read by 8 native speakers across 9 recording sessions, together with per-session sentence-to-audio mapping files enabling precise alignment between textual and acoustic data. Sentences were drawn from a scripted speech prompt list and read by each speaker in a controlled environment. The primary added value of this dataset lies in its demographic composition: the majority of contributing speakers are female, a demographic group that is significantly underrepresented in the existing Common Voice Scripted Speech 25.0 – Bulu dataset available on the Mozilla Data Collective platform. By providing a substantial body of high-quality female Bulu speech, this dataset directly addresses the speaker gender imbalance in available Bulu speech resources and enables the development and evaluation of speech technology models that are more robust across speaker genders. The dataset follows the orthography established by the American Presbyterian Mission (Mission Protestante Américaine, MPA), the historically grounded and community-recognised writing standard for Bulu. This orthography, developed by MPA missionaries and Bulu-speaking collaborators from the late nineteenth century onwards, was codified through the Bulu Bible translation, grammar descriptions, and literacy materials that have shaped Bulu literacy for over a century. From a methodological perspective, the dataset is designed to complement the existing Common Voice Scripted Speech resource for Bulu rather than to replace it, thereby extending the total amount of available Bulu speech data while improving demographic coverage and orthographic fidelity. The parallel availability of MPA-transcribed text and aligned speech makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), forced alignment, pronunciation modelling and language learning tools.

Language

Bulu is a Narrow Bantu language belonging to the Beti-Fang group of the Benue-Congo branch. It is indigenous to a population located primarily in the South Region of Cameroon, with significant speech communities in the Centre Region. Bulu is closely related to Ewondo, Fang, Ntumu and Eton within the Beti-Fang macro-language group. Ethnologue estimates the number of speakers at approximately 800,000, including first- and second-language users. Despite its relatively large speaker base, Bulu remains significantly underrepresented in language technology resources.

Variants

The glossonym 'Bulu' designates a set of closely related linguistic varieties whose speakers may or may not identify with the label, depending on geographical, social and pragmatic factors. In the framework of the Atlas Linguistique du Cameroun (ALCAM) project, Bulu is listed as one of the major micro-languages of the Beti-Fang macro-language, alongside Fang, Ewondo, Ntumu and Eton. Sub-varieties such as Yelinda and Bane are considered varieties of Bulu in standard classifications, though this classification is not always accepted by all their speakers.

The present dataset represents speakers of the Bulu variety as spoken in the Yaoundé area (Centre Region), recruited at the École Normale Supérieure de Yaoundé.

Writing System

The writing system used for the transcription of Bulu in this dataset is the orthographic standard of the American Presbyterian Mission (Mission Protestante Américaine, MPA). This orthography was developed from the late nineteenth century onwards by MPA missionaries working in close collaboration with Bulu-speaking communities in the South Region of Cameroon, and was progressively consolidated through a series of foundational works: grammars, dictionaries, and above all the full translation of the Bible into Bulu, whose various editions spanning from the early twentieth century to the present constitute the most widely circulated corpus of written Bulu. The MPA orthographic standard has thus been the dominant and community-recognised literacy norm for Bulu for over a century and remains in active use in religious, educational and everyday written communication.

1. Vowels

The MPA orthography employs diacritics on vowels to represent phonemically distinct sounds, in a manner analogous to French orthographic conventions. Diacritics signal differences in vowel quality rather than suprasegmental properties such as tone. The principal vowel contrasts attested in the dataset are:

o vs. ô: two phonemically distinct o-quality vowels. ô (LATIN SMALL LETTER O WITH CIRCUMFLEX) is the most frequent diacriticized character in the dataset (×821) and marks a vowel phoneme systematically distinct from plain o (e.g. ôse, ôkon, lôn, jôm, yôp, mongô).
e vs. é: two phonemically distinct e-quality vowels. é (LATIN SMALL LETTER E WITH ACUTE) is the second most frequent diacriticized character (×526) and marks a vowel phoneme systematically distinct from plain e (e.g. ésa, éte, jôé, meté).

The remaining base vowels are a, i, u. Long vowels are represented by vowel doubling (e.g. aa, ee, oo).

2. Consonants

The consonant inventory reflected in the dataset includes simple, prenasalized and digraph consonants, as well as two specially marked consonants:

b, d, dz, f, g, h, j, k, l, m, mb, mv, n, nd, ng, nk, nz, ny, p, s, t, ts, v, w, y, z

Special consonants:

' (apostrophe / RIGHT SINGLE QUOTATION MARK, U+2019): marks the laryngeal consonant, phonemically contrastive in Bulu (e.g. ndô'ôtô, wô'ô, ve'ele, tu'a, fô'ôsan).
ñ / ň: marks the palatal nasal consonant /ɲ/ (e.g. éyoñ, minlaň, nyoňe). Both graphemes (LATIN SMALL LETTER N WITH TILDE and LATIN SMALL LETTER N WITH CARON) occur in the dataset and represent the same sound; this orthographic variation reflects the source texts from which prompt sentences were drawn.

Source

The dataset was compiled from scripted speech prompt lists read by native speakers of Bulu in recording sessions held at the École Normale Supérieure de Yaoundé in 2026, in the framework of the Mozilla Data Collective project. Sentences were selected to provide broad phonological coverage of Bulu and were transcribed in accordance with the MPA orthographic standard, with systematic diacritic notation for phonemically contrastive vowels and consonants.

Domain

The dataset represents scripted speech in Bulu, covering a broad range of everyday sentence types drawn from a general-purpose ASR/TTS prompt list. All utterances are scripted rather than spontaneous.

Size

Total audio duration: 3,468 seconds (00:57:48), distributed across 819 MP3 audio clips in 9 recording sessions contributed by 8 native speakers of Bulu.

Structure

The dataset comprises:

819 MP3 audio clips read by 8 native speakers of Bulu, with a total duration of 3,468 seconds (00:57:48), distributed across 9 recording sessions:
- Session bum_01: 100 clips (5m 46s)
- Session bum_02: 100 clips (8m 44s)
- Session bum_02-1: 100 clips (6m 08s)
- Session bum_03: 100 clips (8m 35s)
- Session bum_04: 100 clips (4m 57s)
- Session bum_05: 100 clips (6m 55s)
- Session bum_06: 100 clips (7m 37s)
- Session bum_07: 100 clips (8m 05s)
- Session bum_08: 19 clips (0m 56s)
Nine per-session sentence-to-audio mapping files (mapping.tsv), each with 4 columns.

Description of columns (mapping.tsv)

#audio_filename: filename of the audio clip (MP3)
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker, transcribed in MPA orthography with diacritic notation for phonemically contrastive sounds
#attempts: number of recording attempts before acceptance

Sample

audio file	sentence (Bulu, MPA orthography)
dab6fa24008dd83e7f212d99d3f4eb35.mp3	si é mbe
15c5db6891149271f924ed57c04ce097.mp3	mot ane jôé na Ondo Mba
53610d4a2b58ee24a154e380533a7f50.mp3	ndô'ôtô ô nji tu'a wô'ô jôm é ne meté
73df4abb3241efec595196835c91f5b7.mp3	à nga kalan je yôp à ve'ele wulu je yôp