Basaa-ASR-Dataset

Description

Basaa-ASR-Dataset is a curated speech dataset dedicated to the documentation and technological development of the Basaa language (ISO 639-3: bas), spoken in the Centre, Littoral, and South regions of Cameroon. The dataset was collected as part of a speech data initiative carried out at the École Normale Supérieure de Yaoundé (ENS-Yaoundé) in collaboration with the Mozilla Data Collective (MDC), and is intended to complement the two existing MDC datasets for Basaa: "Common Voice Scripted Speech 25.0 – Basaa" (https://mozilladatacollective.com/datasets/cmn2bk0jn01f2mm07dv6v1jbt) and "Common Voice Spontaneous Speech 3.0 – Basaa" (https://mozilladatacollective.com/datasets/cmn1pow3f00vtmm07dre8danl). All three datasets may be used complementarily depending on the task at hand. The added value of Basaa-ASR-Dataset is twofold. First, the transcription orthography used throughout the dataset is the General Alphabet of Cameroon's Languages (French acronym: AGLC), a standardised, phonologically motivated writing system that was insufficiently represented in previous Basaa speech datasets. Second, the dataset explicitly captures a broader range of Basaa dialectal varieties, including Basaa-ba-Yabassi and Babimbi, which were not sufficiently represented in earlier collections. The dataset comprises 2,498 MP3 audio recordings distributed across 25 recording sessions. Fifteen sessions (identified as asr-tts_dataset_bas_21 through bas_38) were conducted with 15 distinct speakers, each reading a set of 100 sentences in Basaa. The remaining 10 sessions (bassa_tts_dataset_01 through 10) constitute a dedicated TTS sub-corpus. All recordings were made using the MDC recording platform, which captures speaker metadata including the number of attempts per sentence. A total of 2,500 unique sentences were used across the dataset — 1,500 in the ASR/TTS sessions and 1,000 in the TTS sessions — with no overlap between the two sets. The total recorded audio amounts to 6,887 seconds (01:54:46).

Language

Basaa is a narrow Bantu language (ISO 639-3: bas) spoken across a geographical area spanning three administrative regions in Cameroon: the Centre, Littoral, and South regions. It is estimated that there are currently around 600,000–700,000 speakers, including speakers of different varieties as well as diasporic populations who identify as Basaa speakers.

The vitality of the Basaa language is broadly stable (Ethnologue online). However, intergenerational transmission is increasingly threatened among parents aged 50 and under, particularly in urban areas. Although Basaa is taught in some schools, this has not significantly impacted language vitality, mainly due to a reliance on rule-based and descriptivist teaching methods.

Variants

The glossonym 'Basaa' is a generic term encompassing a range of varieties, the speakers of which may identify with the 'Basaa' label to varying degrees, depending on a complex set of geographical, social, political, situational, and pragmatic factors. Some of the most commonly acknowledged varieties include:

Mbene
Bikok
Babimbi
Basaa ba Omeng
Basaa ba Yabassi
Basaa ba Duala
Ndog-Bikim

Other varieties, such as Ndonga, Mbaa (also known as Mbay-Bati), and Hijuk, may also be classified as Basaa, though not all speakers agree on this classification.

The present dataset explicitly includes speakers of Basaa-ba-Yabassi and Babimbi, varieties that were insufficiently represented in the earlier Common Voice Basaa datasets. It therefore extends dialectal coverage across the Basaa-speaking area.

Writing System

The writing system used for all transcriptions in this dataset is the General Alphabet of Cameroon's Languages (French acronym: AGLC), a standardised orthography developed for Cameroonian languages under the auspices of PROPELCA (Operational Research Project for Language Teaching in Cameroon). The AGLC is the orthographic standard closest to the phonological system of Basaa and provides a more consistent and reproducible representation of its consonantal, vocalic, and tonal distinctions than earlier missionary-based orthographies.

1. Vowels

The vowel system of Basaa as represented in the AGLC comprises seven oral vowel qualities:

a, e, ɛ, ı (dotless i), ɔ, o, u

Each vowel may carry one of the tonal diacritics described in section 3. The dotless letter ı (U+0131) is used by the AGLC to distinguish a specific high front unrounded vowel from the standard Latin i.

2. Consonants

The consonant inventory reflected in the dataset includes simple and complex consonants:

Simple consonants: p, b, ɓ, c, d, g, h, j, k, l, m, n, ŋ, s, t, w, y

Complex consonants: mb, nd, ŋg, ŋgw, ny

The implosive bilabial ɓ is a phonologically distinctive consonant in Basaa and is systematically represented in the AGLC orthography.

3. Tone system

Basaa has a grammatically and lexically contrastive tonal system. Tones are marked directly on vowels in the AGLC orthography:

High tone (H): á, é, ɛ́, í, ɔ́, ó, ú
Low tone (L): à, è, ɛ̀, ı̀, ɔ̀, ò, ù
Falling tone (HL): â, ê, ɛ̂, ı̂, ɔ̂, ô, û
Rising tone (LH): ǎ, ě, ɛ̌, ı̌, ɔ̌, ǒ, ǔ
Mid / level tone (downstep/upstep): ā, ē, ɛ̄, ī, ɔ̄, ō, ū

Source

The dataset was collected through the MDC speech recording platform as part of a data collection initiative conducted at the École Normale Supérieure de Yaoundé (ENS-Yaoundé), Cameroon, under the coordination of Emmanuel Ngue Um. Sentences were drawn from a corpus of general Basaa utterances compiled for ASR and TTS tasks, transcribed in the AGLC orthography by trained annotators familiar with the writing system. Speakers were recruited to represent a range of Basaa varieties, with particular effort to include speakers of Basaa-ba-Yabassi and Babimbi.

Domain

The dataset represents general spoken Basaa drawn from everyday communicative contexts: declarative statements, questions, imperatives, and conversational exchanges. Sentences span a broad range of lexical and grammatical domains, making the dataset suitable as a general-purpose speech resource for Basaa.

Size

Total audio duration: 6,887 seconds (01:54:46), measured from 2,498 MP3 audio clips distributed across 25 recording sessions. Total uncompressed dataset size: approximately 65 MB.

Structure

The dataset comprises:

2,498 MP3 audio clips organised in 25 recording session folders:
- 15 ASR/TTS session folders (asr-tts_dataset_bas_21 through bas_38), each with 100 clips read by a distinct speaker;
- 10 TTS session folders (bassa_tts_dataset_01 through 10), each with 100 clips.
25 per-session sentence-to-audio mapping files (mapping.tsv), each with 4 columns.

A total of 2,500 unique sentences are covered: 1,500 in the ASR/TTS sessions and 1,000 in the TTS sessions, with no overlap between the two sub-corpora.

Description of columns (mapping.tsv)

#audio_filename: filename of the audio clip (original WebM key, corresponding to the MP3 file of the same base name)
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker, transcribed in AGLC orthography
#attempts: number of recording attempts before acceptance

Sample

audio file	sentence (Basaa, AGLC)
6203dea4f8fa8f6d827d5fad9bb72b6b.mp3	Ɓàjɛtɛɛ ɓɔn nı̀ gwèe hāna!.
512c84fbc51e9bd52e8068460bd33a1c.mp3	Lini jı̀bâ li nnɛ̄.
ce15014053f849adbe2c1fa06f97f3c2.mp3	Ti mɛ̀ jı̀bàn matoà.
8f2875ed7bae8d5c5bb36a5b822200b9.mp3	Jı̌bɛ̀ li ŋkop.
3b01ba1ecc5386c07e570443b61801cf.mp3	Nyɔ̀ɔ ı̀ a mɛ̀ màn yani nı̀ mɛ̀ mɛ̀ tibil nyɛ.
a230f7310aff7012623eb2370fcafb66.mp3	À ŋkɛ̀.
e9489d5b683a08c42c60da52ff4fc535.mp3	Aa, mɛ̀ m̂ɓômdà!
ca088743598cac232ba9af6156f6c6d9.mp3	À ǹyonos biaâ.