License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmqf58twx06fdl207ihp2oywl
Task: TTS
Release Date: 6/15/2026
Format: MP3, TSV
Size: 38.64 MB
Share
Medumba-TTS-Dataset is a scripted speech dataset dedicated to the documentation and technological development of Medumba (ISO 639-3: byv), a Grassfields Bantu language spoken in the Ndé Division of the West Region of Cameroon. The dataset was compiled in the framework of the Mozilla Data Collective initiative (2026), in addition to the existing Common Voice Scripted Speech 25.0 – Medumba dataset (https://mozilladatacollective.com/datasets/cmn2chivm01cbo107vqvbgn2i). The dataset comprises 994 high-quality MP3 audio recordings of Medumba sentences read by a native speaker across 10 recording sessions, together with per-session sentence-to-audio mapping files enabling precise alignment between textual and acoustic data. Sentences were drawn from a scripted speech prompt list and read in a controlled environment. The transcription of all sentences follows the General Alphabet of Cameroon's Languages (AGLC; French acronym: Alphabet Général des Langues Camerounaises), the reference standard for Cameroonian national languages. The Medumba AGLC orthography is distinguished by an extended vowel inventory — including the low back unrounded vowel ɑ, the open-mid front unrounded vowel ɛ, the open-mid back rounded vowel ɔ, the high central rounded vowel ʉ, and the mid central schwa ə — as well as a set of labialized consonants written by appending w to the base consonant (e.g., kw, gw, sw, bw), a series of pre-nasalized consonants written as digraphs or trigraphs (e.g., mb, nd, ŋg, ns, nsw), a two-level tone-marking system using grave (low) and contour diacritics (caron for rising LH; circumflex for falling HL) applied to vowels — high tone being unmarked — and the modifier letter apostrophe (ʼ) for the glottal stop. The parallel availability of AGLC-transcribed text and aligned speech makes the dataset suitable for a wide range of applications, including text-to-speech (TTS) synthesis, automatic speech recognition (ASR), forced alignment, pronunciation modelling, and language learning tools. It also directly supports efforts to standardise and normalise the digital representation of Medumba in language technology contexts.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - That you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of any speaker in the dataset; attempting to clone any voice or train models that imitate any speaker in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
(a) Speech-related tasks: - Text-to-speech (TTS) synthesis: The dataset provides clean sentence–audio pairs from multiple recording sessions and is directly suited for training, fine-tuning, and evaluating speech synthesis models for Medumba. The availability of AGLC-transcribed sentences with aligned audio enables the development of TTS systems capable of producing natural-sounding Medumba speech. - Automatic speech recognition (ASR): Audio–text alignment enables the training and evaluation of speech recognition models for Medumba. The per-session structure and controlled recording conditions make the dataset suitable for building and evaluating ASR models for this under-resourced language. - Speech–text alignment / forced alignment benchmarking: Fine-grained audio–text pairing provides ground truth for evaluating phoneme- or word-level aligners adapted to Grassfields Bantu languages of the Ndé area. - Pronunciation modelling: The AGLC-transcribed sentences, combined with aligned audio, provide a resource for developing grapheme-to-phoneme (G2P) models and pronunciation lexicons for Medumba. (b) Linguistic and lexicographic tasks: - Phonological analysis: The dataset enables systematic study of the phonological and tonal system of Medumba, including its two-level tone contrast on verbs, four-way noun tone class distinction (Voorhoeve 1971), extended vowel inventory (ɑ, ɛ, ɔ, ʉ, ə), labialized consonant series (kw, gw, sw, bw, etc.), and rich inventory of pre-nasalized consonants (mb, nd, ŋg, ns, nsw, etc.). - Orthographic standardisation and normalisation: The dataset can serve as a reference corpus for evaluating and training text normalisation models aligned with the AGLC standard for Medumba, and for documenting and resolving the inter-session orthographic variation observed in this dataset (see Writing System § 4 above). - Language documentation: The dataset contributes to the digital documentation of Medumba scripted speech in CEPOM orthography, supporting efforts to extend the digital presence of this Grassfields Bantu language of the West Region of Cameroon. It is part of a broader documentation effort that addresses rising levels of language endangerment documented for Cameroon (Kandybowicz & Torrence 2017).
Medumba (also written Mə̀dʉ̂mbɑ̀) is a Bamileke language belonging to the Niger-Congo phylum, classified within the Mbam-Nkam branch of the Eastern Grassfields subgroup. It is spoken in the Ndé Division of the West Region of Cameroon, with its main settlements in Bangangté, Bakong, Bangoulap, Bahouoc, Bagnoun and Tonga. Medumba is the most widely researched language of the Bamileke cluster, which also comprises Fe'fe', Ghomálá', Kwa', and Nda'nda'. Despite its sociolinguistic significance within Cameroon — estimated at approximately 210,000 speakers (Ethnologue, 1991) — Medumba remains substantially underrepresented in language technology resources.
Medumba has one identified dialectal variant, the Batongtou dialect, alongside the Bangangté dialect, which constitutes the primary reference variety in linguistic documentation and on which the bulk of the scholarly literature is based (Voorhoeve 1965, 1967, 1971, 1977).
The writing system used for the transcription of Medumba in this dataset is the General Alphabet of Cameroon's Languages (AGLC; French acronym: Alphabet Général des Langues Camerounaises), the reference standard for Cameroonian national languages. The Medumba AGLC orthography was developed and is maintained by CEPOM (Comité de Langue pour l'Etude et la Production des Œuvres Bamiléké-Medumba), based in Bangangté, and was formally adopted at the 4th CEPOM Council on 2 February 1985.
Medumba has 12 simple phonemic vowels (Voorhoeve 1965, 1977). The vowel system attested in the dataset includes the following oral vowels:
a, ɑ, e, ə, ɛ, i, o, ɔ, u, ʉ
Where:
a: low central unrounded vowel
ɑ (alpha): low back unrounded vowel
e: high-mid front unrounded vowel (+ATR)
ə (schwa): mid central unrounded vowel
ɛ (epsilon): open-mid front unrounded vowel
i: high front unrounded vowel (+ATR)
o: high-mid back rounded vowel (+ATR)
ɔ (open-o): open-mid back rounded vowel
u: high back rounded vowel (+ATR)
ʉ (barred u): high central rounded vowel
In addition, Medumba has five phonemic diphthongs: ia, ʉa, iə, ʉɑ, and uɑ.
The consonant inventory reflected in the dataset includes simple, labialized, and pre-nasalized consonants in accordance with the CEPOM standard:
Simple consonants: b, c (=tʃ), d, f, g, gh (=ɣ), h, j (=ɟ), k, l, m, n, ŋ, ny (=ɲ), s, sh (=ʃ), t, ts (=ʦ), v, w, y, z, ʼ (=ʔ)
Labialized consonants are formed by appending w to the base consonant or cluster:
bw, cw, fw, gw, jw, kw, nw, ŋw, sw
Pre-nasalized consonants are written as digraphs (NC) or trigraphs (NCW):
mb, mbw, nd, nj, njw, ŋg, ŋgw, nt, nc, ncw, ŋk, ŋkw, mf, mfw, ns, nsw
Special symbols:
ŋ (eng): velar nasal consonant
ʼ (modifier letter apostrophe, U+02BC): glottal stop
Medumba is a tonal language with two surface-level tone contrasts (Low and High) on verbs and a four-way noun tone class distinction (Voorhoeve 1971). The dataset employs tone marking on vowels in accordance with the CEPOM convention. The following diacritics are attested in the dataset:
High tone (H): unmarked
Low tone (L): grave accent — à, ɑ̀, è, ə̀, ɛ̀, ì, ò, ɔ̀, ù, ʉ̀
Falling contour (HL): circumflex — â, ɑ̂, ê, ə̂, ɛ̂, î, ô, ɔ̂, û, ʉ̂
Rising contour (LH): caron — ǎ, ǝ̌, ɛ̌, ǐ, ǒ, ɔ̌, ǔ, ʉ̌
A careful review of the per-session mapping files reveals inter-session orthographic variation arising from differences in input method and transcriber practice. The following deviations from the CEPOM standard were observed:
Sessions 01 and 05: Use of IPA composite digraph characters (ʧ U+02A7 for CEPOM c; ʤ U+02A4 for CEPOM ndj; ʒ U+0292 for CEPOM j or nj; ʃ U+0283 for CEPOM sh; ɣ U+0263 for CEPOM gh; ɲ U+0272 for CEPOM ny; ɱ U+0271 for CEPOM m); these sessions also use the IPA glottal stop character (ʔ U+0294) in place of the CEPOM modifier letter apostrophe (ʼ U+02BC).
Session 02: Use of the Greek small letter alpha (α U+03B1) in place of the CEPOM Latin small letter alpha (ɑ U+0251).
Sessions 03 and 05: Use of the Latin small letter turned e (ǝ U+01DD) or Cyrillic small letter schwa (ә U+04D9) in place of the CEPOM Latin small letter schwa (ə U+0259).
Sessions 06–10: Use of the Latin small letter n with dot above (ṅ U+1E45) in place of the CEPOM eng (ŋ U+014B); use of the right single quotation mark (' U+2019) alongside or instead of the CEPOM modifier letter apostrophe (ʼ U+02BC) for the glottal stop.
These orthographic variants do not affect the phonological coverage of the dataset but should be taken into account in text normalisation pipelines.
The dataset was compiled from scripted speech prompt lists read by a native speaker of Medumba (Bangangté dialect). Sentences were selected to provide broad phonological coverage of Medumba and were transcribed in accordance with the AGLC orthographic standard.
The dataset represents scripted speech in Medumba, covering a broad range of everyday sentence types drawn from a general-purpose TTS/ASR prompt list. All utterances are scripted rather than spontaneous.
Total audio duration: 4,276 seconds (01h 11m 16s), distributed across 994 MP3 audio clips in 10 recording sessions.
Audio duration computed using audio-duration.py (mutagen 1.47.0; ffprobe-verified).
The dataset is organised into 10 recording sessions:
Session tts_dataset_byv_01: 99 clips (08m 03s)
Session tts_dataset_byv_02: 100 clips (07m 52s)
Session tts_dataset_byv_03: 100 clips (07m 18s)
Session tts_dataset_byv_04: 100 clips (07m 35s)
Session tts_dataset_byv_05: 100 clips (07m 50s)
Session tts_dataset_byv_06: 100 clips (05m 43s)
Session tts_dataset_byv_07: 100 clips (06m 31s)
Session tts_dataset_byv_08: 100 clips (06m 40s)
Session tts_dataset_byv_09: 100 clips (06m 29s)
Session tts_dataset_byv_10: 95 clips (07m 12s)
Each session folder contains:
MP3 audio clips
One per-session sentence-to-audio mapping file (mapping.tsv), with 4 columns
#audio_filename: filename of the audio clip (MP3)
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker, transcribed in CEPOM orthography
#attempts: number of recording attempts before acceptance
| audio file | sentence (Medumba, AGLC) |
|---|---|
| bf3b017ea63a09123b36c0acb58aab43.mp3 | ŋgàmbándá àʔ- nɛ̀ɛ́n ʧwɛ̀t ndàáʔndʒʉ á gə̀- lɛ̀ɛ́n mbə́zə̄ mbàŋ gə́- lú |
| 421886434692cfc9d071901cc0d81bac.mp3 | lɛ̂n sə bə α̂ Fʉ̀'nkə'ə à bə a zə nunga |
| a240324809aa0f990a72e2f42dc6c649.mp3 | mə ghʉ̌ tà' nshun mɛnmαndùm mbὰ tà' nshun mɛn mɛ̀nnzwi |
| 566d2baa7091db5d067acf13e99931a7.mp3 | àbâ tɔ̌ tûɁndá zə̄ Númí lù - ngǝ́ bɛ́tǝ́ càŋ ŋkɔ̀k tə̀tswə́ |
| bf03399bfe1b63c531dca78b33147275.mp3 | Nya bùnte ndù |