License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmq9hwfbr02gfmk07m1jgvwje
Task: ASR
Release Date: 6/11/2026
Format: TSV, MP3
Size: 76.89 MB
Share
Basaa-ASR-Dataset is a curated speech dataset dedicated to the documentation and technological development of the Basaa language (ISO 639-3: bas), spoken in the Centre, Littoral, and South regions of Cameroon. The dataset was collected as part of a speech data initiative carried out at the École Normale Supérieure de Yaoundé (ENS-Yaoundé) in collaboration with the Mozilla Data Collective (MDC), and is intended to complement the two existing MDC datasets for Basaa: "Common Voice Scripted Speech 25.0 – Basaa" (https://mozilladatacollective.com/datasets/cmn2bk0jn01f2mm07dv6v1jbt) and "Common Voice Spontaneous Speech 3.0 – Basaa" (https://mozilladatacollective.com/datasets/cmn1pow3f00vtmm07dre8danl). All three datasets may be used complementarily depending on the task at hand. The added value of Basaa-ASR-Dataset is twofold. First, the transcription orthography used throughout the dataset is the General Alphabet of Cameroon's Languages (French acronym: AGLC), a standardised, phonologically motivated writing system that was insufficiently represented in previous Basaa speech datasets. Second, the dataset explicitly captures a broader range of Basaa dialectal varieties, including Basaa-ba-Yabassi and Babimbi, which were not sufficiently represented in earlier collections. The dataset comprises 2,498 MP3 audio recordings distributed across 25 recording sessions. Fifteen sessions (identified as asr-tts_dataset_bas_21 through bas_38) were conducted with 15 distinct speakers, each reading a set of 100 sentences in Basaa. The remaining 10 sessions (bassa_tts_dataset_01 through 10) constitute a dedicated TTS sub-corpus. All recordings were made using the MDC recording platform, which captures speaker metadata including the number of attempts per sentence. A total of 2,500 unique sentences were used across the dataset — 1,500 in the ASR/TTS sessions and 1,000 in the TTS sessions — with no overlap between the two sets. The total recorded audio amounts to 6,887 seconds (01:54:46).
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
The sentence transcriptions in this dataset are provided in the AGLC orthography. Researchers using the dataset for ASR or language modelling tasks should ensure that their text processing pipelines are compatible with AGLC-specific characters, in particular the dotless **ı** (U+0131), the implosive **ɓ** (U+0253), the mid vowels **ɛ** (U+025B) and **ɔ** (U+0254), and the full range of combining tonal diacritics. Users are also encouraged to consult the two complementary Common Voice Basaa datasets on the MDC platform, which may provide additional acoustic and transcription coverage. By downloading this dataset, you agree: - To use it for research and scientific use only - That you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of speakers in the dataset; attempting to clone voices or train models that imitate speakers in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
Intended Usage (a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment enables the training and evaluation of speech recognition models for Basaa. Sentences are transcribed in the AGLC orthography, which provides a consistent phonological representation and is preferable to IPA for downstream ASR applications. This dataset is the first MDC Basaa speech resource to use AGLC as its transcription standard, and can be used alongside the two Common Voice Basaa datasets — which use different orthographic conventions — for a more complete language coverage. - Text-to-speech (TTS): The dataset contains clean sentence–audio pairs and can be used to train or evaluate speech synthesis models for Basaa. The dedicated TTS sub-corpus (10 sessions, 1,000 clips) provides a focused resource for TTS development. Users should ensure that their frontend text normalisation pipelines handle AGLC characters correctly. - Speech–text alignment / forced alignment benchmarking: Fine-grained audio–sentence pairing provides ground truth for evaluating phoneme- or word-level aligners adapted to tonal and phonologically complex Bantu languages. (b) Translation and multilingual tasks: - Speech translation (speech-to-text) (c) Linguistic and lexicographic tasks: - Dialectal and sociolinguistic studies: The explicit inclusion of multiple Basaa varieties — in particular Basaa-ba-Yabassi and Babimbi — makes this dataset useful for comparative dialectological work and the study of intra-Basaa phonological and lexical variation. - Morphological analysis and lexicon development: The AGLC-transcribed sentences support the development of morphological analysers, pronunciation lexicons, and computational grammars for Basaa. - Language documentation and revitalisation: The dataset contributes to the digital documentation of spoken Basaa across its dialectal range, in a format directly usable for language technology applications and language education tools.
Basaa is a narrow Bantu language (ISO 639-3: bas) spoken across a geographical area spanning three administrative regions in Cameroon: the Centre, Littoral, and South regions. It is estimated that there are currently around 600,000–700,000 speakers, including speakers of different varieties as well as diasporic populations who identify as Basaa speakers.
The vitality of the Basaa language is broadly stable (Ethnologue online). However, intergenerational transmission is increasingly threatened among parents aged 50 and under, particularly in urban areas. Although Basaa is taught in some schools, this has not significantly impacted language vitality, mainly due to a reliance on rule-based and descriptivist teaching methods.
The glossonym 'Basaa' is a generic term encompassing a range of varieties, the speakers of which may identify with the 'Basaa' label to varying degrees, depending on a complex set of geographical, social, political, situational, and pragmatic factors. Some of the most commonly acknowledged varieties include:
Mbene
Bikok
Babimbi
Basaa ba Omeng
Basaa ba Yabassi
Basaa ba Duala
Ndog-Bikim
Other varieties, such as Ndonga, Mbaa (also known as Mbay-Bati), and Hijuk, may also be classified as Basaa, though not all speakers agree on this classification.
The present dataset explicitly includes speakers of Basaa-ba-Yabassi and Babimbi, varieties that were insufficiently represented in the earlier Common Voice Basaa datasets. It therefore extends dialectal coverage across the Basaa-speaking area.
The writing system used for all transcriptions in this dataset is the General Alphabet of Cameroon's Languages (French acronym: AGLC), a standardised orthography developed for Cameroonian languages under the auspices of PROPELCA (Operational Research Project for Language Teaching in Cameroon). The AGLC is the orthographic standard closest to the phonological system of Basaa and provides a more consistent and reproducible representation of its consonantal, vocalic, and tonal distinctions than earlier missionary-based orthographies.
The vowel system of Basaa as represented in the AGLC comprises seven oral vowel qualities:
a, e, ɛ, ı (dotless i), ɔ, o, u
Each vowel may carry one of the tonal diacritics described in section 3. The dotless letter ı (U+0131) is used by the AGLC to distinguish a specific high front unrounded vowel from the standard Latin i.
The consonant inventory reflected in the dataset includes simple and complex consonants:
Simple consonants: p, b, ɓ, c, d, g, h, j, k, l, m, n, ŋ, s, t, w, y
Complex consonants: mb, nd, ŋg, ŋgw, ny
The implosive bilabial ɓ is a phonologically distinctive consonant in Basaa and is systematically represented in the AGLC orthography.
Basaa has a grammatically and lexically contrastive tonal system. Tones are marked directly on vowels in the AGLC orthography:
High tone (H): á, é, ɛ́, í, ɔ́, ó, ú
Low tone (L): à, è, ɛ̀, ı̀, ɔ̀, ò, ù
Falling tone (HL): â, ê, ɛ̂, ı̂, ɔ̂, ô, û
Rising tone (LH): ǎ, ě, ɛ̌, ı̌, ɔ̌, ǒ, ǔ
Mid / level tone (downstep/upstep): ā, ē, ɛ̄, ī, ɔ̄, ō, ū
The dataset was collected through the MDC speech recording platform as part of a data collection initiative conducted at the École Normale Supérieure de Yaoundé (ENS-Yaoundé), Cameroon, under the coordination of Emmanuel Ngue Um. Sentences were drawn from a corpus of general Basaa utterances compiled for ASR and TTS tasks, transcribed in the AGLC orthography by trained annotators familiar with the writing system. Speakers were recruited to represent a range of Basaa varieties, with particular effort to include speakers of Basaa-ba-Yabassi and Babimbi.
The dataset represents general spoken Basaa drawn from everyday communicative contexts: declarative statements, questions, imperatives, and conversational exchanges. Sentences span a broad range of lexical and grammatical domains, making the dataset suitable as a general-purpose speech resource for Basaa.
Total audio duration: 6,887 seconds (01:54:46), measured from 2,498 MP3 audio clips distributed across 25 recording sessions. Total uncompressed dataset size: approximately 65 MB.
The dataset comprises:
2,498 MP3 audio clips organised in 25 recording session folders:
15 ASR/TTS session folders (asr-tts_dataset_bas_21 through bas_38), each with 100 clips read by a distinct speaker;
10 TTS session folders (bassa_tts_dataset_01 through 10), each with 100 clips.
25 per-session sentence-to-audio mapping files (mapping.tsv), each with 4 columns.
A total of 2,500 unique sentences are covered: 1,500 in the ASR/TTS sessions and 1,000 in the TTS sessions, with no overlap between the two sub-corpora.
#audio_filename: filename of the audio clip (original WebM key, corresponding to the MP3 file of the same base name)
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker, transcribed in AGLC orthography
#attempts: number of recording attempts before acceptance
| audio file | sentence (Basaa, AGLC) |
|---|---|
| 6203dea4f8fa8f6d827d5fad9bb72b6b.mp3 | Ɓàjɛtɛɛ ɓɔn nı̀ gwèe hāna!. |
| 512c84fbc51e9bd52e8068460bd33a1c.mp3 | Lini jı̀bâ li nnɛ̄. |
| ce15014053f849adbe2c1fa06f97f3c2.mp3 | Ti mɛ̀ jı̀bàn matoà. |
| 8f2875ed7bae8d5c5bb36a5b822200b9.mp3 | Jı̌bɛ̀ li ŋkop. |
| 3b01ba1ecc5386c07e570443b61801cf.mp3 | Nyɔ̀ɔ ı̀ a mɛ̀ màn yani nı̀ mɛ̀ mɛ̀ tibil nyɛ. |
| a230f7310aff7012623eb2370fcafb66.mp3 | À ŋkɛ̀. |
| e9489d5b683a08c42c60da52ff4fc535.mp3 | Aa, mɛ̀ m̂ɓômdà! |
| ca088743598cac232ba9af6156f6c6d9.mp3 | À ǹyonos biaâ. |