Ewondo-ASR-Dataset

Description

Ewondo-ASR-Dataset is a scripted speech dataset dedicated to the documentation and technological development of Ewondo (ISO 639-3: ewo), a Narrow Bantu language spoken primarily in the Centre, South and East Regions of Cameroon, where it also functions as a vehicular language. The dataset was compiled at the École Normale Supérieure de Yaoundé with contribution from students. The dataset comprises 1,781 high-quality MP3 audio recordings of Ewondo sentences read by 16 native speakers across 19 recording sessions, together with per-session sentence-to-audio mapping files enabling precise alignment between textual and acoustic data. Sentences were drawn from a scripted speech prompt list and read by each speaker in a controlled environment. The primary added value of this dataset lies in its orthographic alignment with the General Alphabet of Cameroon's Languages (AGLC; French acronym: AGLC — Alphabet Général des Langues Camerounaises), the reference standard for Cameroonian national languages. In particular, this dataset preserves systematic tone marking, a feature that the existing Common Voice Scripted Speech 25.0 – Ewondo dataset available on the Mozilla Data Collective platform tends to omit. By making tone information explicit in the transcription, this dataset enables the development and evaluation of speech technology models that are sensitive to the tonal contrasts that are phonemically contrastive in Ewondo. From a methodological perspective, the dataset is designed to complement the existing Common Voice Scripted Speech resource for Ewondo rather than to replace it, thereby extending the total amount of available Ewondo speech data aligned with an orthographically principled transcription standard. The parallel availability of AGLC-transcribed text and aligned speech makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), forced alignment, pronunciation modelling and language learning tools. It also directly supports efforts to standardise and normalise the digital representation of Ewondo in language technology contexts.

Language

Ewondo is a Narrow Bantu language belonging to the Beti-Fang group of the Benue-Congo branch. It is indigenous to a population located primarily in the Centre Region of Cameroon, with significant speech communities in the South and East Regions. Ewondo also functions as a vehicular language in those regions and has given rise to a creolised variety known as Mongo Ewondo. Ethnologue estimates the number of speakers at approximately 900,000, including first- and second-language users. Despite its relatively large speaker base, Ewondo remains significantly underrepresented in language technology resources.

Variants

The glossonym 'Ewondo' designates a set of closely related linguistic varieties whose speakers may or may not identify with the label, depending on geographical, social and pragmatic factors. In the framework of the Atlas Linguistique du Cameroun (ALCAM) project, Ewondo is listed as one of the major micro-languages of the Beti-Fang macro-language, alongside Fang, Bulu, Ntumu and Eton. Varieties such as Yezoum (Haut-Nyong Division), Yanda and Moog-Ebanda are considered sub-varieties of Ewondo in standard classifications, though this classification is not always accepted by their speakers.

The present dataset represents speakers of the Ewondo variety as spoken in the Yaoundé area (Centre Region), recruited at the École Normale Supérieure de Yaoundé.

Writing System

The writing system used for the transcription of Ewondo in this dataset is the General Alphabet of Cameroon's Languages (AGLC), as adopted by the Ministry of Basic Education of Cameroon and regularly updated by the Direction de la Promotion des Langues Nationales. The AGLC provides a phonologically motivated orthographic standard for Cameroonian national languages and serves as the reference framework for Ewondo literacy materials, including those produced by the Catholic and Protestant missionary traditions that have subsequently aligned with this standard.

1. Vowels

The vowel system attested in the dataset includes the following oral vowels:

a, e, ə, i, o, u, ɔ

Long vowels are represented by vowel doubling (e.g. aa, ee, oo).

2. Consonants

The consonant inventory reflected in the dataset includes simple, prenasalized and digraph consonants:

b, d, dz, f, g, h, k, l, m, mb, mv, n, nd, ng, nk, nz, ny, ŋ, p, s, t, ts, v, w, y, z

Special symbols: ə (mid central vowel), ŋ (velar nasal)

3. Tone system

Ewondo is a tonal language with lexical and grammatical contrastive tones. The dataset employs systematic tone marking on vowels in accordance with the AGLC convention:

High tone (H): á, é, ə́, í, ó, ɔ́, ú
Low tone (L): à, è, ə̀, ì, ò, ɔ̀, ù
Falling tone (HL): â, ê, ə̂, î, ô, ɔ̂, û
Rising tone (LH): ǎ, ě, ə̌, ǐ, ǒ, ɔ̌, ǔ

Unmarked vowels represent tonally neutral or contextually determined syllables. This explicit tone notation distinguishes the present dataset from the Common Voice Scripted Speech 25.0 – Ewondo resource, in which tone diacritics are systematically absent.

Source

The dataset was compiled from scripted speech prompt lists read by native speakers of Ewondo in recording sessions held at the École Normale Supérieure de Yaoundé in 2026, in the framework of the Mozilla Data Collective project. Sentences were selected to provide broad phonological coverage of Ewondo and were transcribed in accordance with the AGLC orthographic standard, with full tone marking.

Domain

The dataset represents scripted speech in Ewondo, covering a broad range of everyday sentence types drawn from a general-purpose ASR/TTS prompt list. All utterances are scripted rather than spontaneous.

Size

Total audio duration: 11,457 seconds (03:10:57), distributed across 1,781 MP3 audio clips in 19 recording sessions contributed by 16 native speakers of Ewondo. Total uncompressed dataset size: approximately [X] MB.

Structure

The dataset comprises:

1,781 MP3 audio clips read by 16 native speakers of Ewondo, with a total duration of 11,457 seconds (03:10:57), distributed across 19 recording sessions:
- Session ewo_01: 97 clips (12m 03s)
- Session ewo_02: 99 clips (14m 41s)
- Session ewo_03: 99 clips (07m 52s)
- Session ewo_04: 10 clips (01m 03s)
- Session ewo_07: 99 clips (08m 51s)
- Session ewo_08: 98 clips (08m 03s)
- Session ewo_09: 96 clips (07m 09s)
- Session ewo_13: 96 clips (10m 23s)
- Session ewo_13-1: 97 clips (17m 29s)
- Session ewo_14: 99 clips (12m 23s)
- Session ewo_14-1: 99 clips (19m 09s)
- Session ewo_15: 99 clips (08m 46s)
- Session ewo_18: 100 clips (12m 04s)
- Session ewo_18-1: 100 clips (10m 52s)
- Session ewo_19: 96 clips (10m 21s)
- Session ewo_22: 99 clips (05m 49s)
- Session ewo_30: 99 clips (06m 55s)
- Session ewo_31: 99 clips (08m 46s)
- Session ewo_32: 100 clips (08m 08s)
Nineteen per-session sentence-to-audio mapping files (mapping.tsv), each with 4 columns.

Description of columns (mapping.tsv)

#audio_filename: filename of the audio clip (MP3)
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker, transcribed in AGLC orthography with tone marking
#attempts: number of recording attempts before acceptance

Sample

audio file	sentence (Ewondo, AGLC)
1dbc5504f402c312236c645b271511f2.mp3	Aa bɔŋ !!! Dɔŋ ósúsúa nâ bitá biá bɔ, ndɔ wa yə̌m fə, wa kad na wa yəm bǎn minlaŋ itə mivɔg, hǹń ?
def9b64235e4e0d803d23de18a665b6a.mp3	Mə̌men makad nə ma yəm, ma kad nə mayəm.
66efb6963a001b47d7989d273726871d.mp3	Abim ma sili wa
96e1bf490f6869899cd23134432393f2.mp3	Iyɔŋ wa síli ma ábím ma yəm mə kadə́ wa.