License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmqf6bpns06gal20726pxld9z
Task: TTS
Release Date: 6/15/2026
Format: MP3, TSV
Size: 46.93 MB
Share
Tupuri-bango_TTS-Dataset is a scripted speech dataset dedicated to the documentation and technological development of Tupuri (ISO 639-3: tui), a Chadic language of the Afro-Asiatic phylum spoken in the Kaele and Mayo-Danay Divisions of the Far-North Region of Cameroon and in adjacent areas of southern Chad. The dataset was compiled in the framework of the Mozilla Data Collective initiative (2026) as a supplement to the Common Voice Scripted Speech 25.0 – Tupuri dataset (https://mozilladatacollective.com/datasets/cmn2ca0qt01fomm07ivn5e89r) and to the Tupuri-ASR-Dataset (https://mozilladatacollective.com/datasets/cmq9hy15t02jpl207wlcqfpxh). The speech material represents the Bango dialectal variety of Tupuri. The dataset comprises 983 high-quality MP3 audio recordings of Tupuri-Bango sentences read by a native speaker across 10 recording sessions, together with per-session sentence-to-audio mapping files enabling precise alignment between textual and acoustic data. Sentences were drawn from a scripted speech prompt list and read in a controlled environment. The transcription of all sentences follows the General Alphabet of Cameroon's Languages (AGLC; French acronym: Alphabet Général des Langues Camerounaises), the reference standard for Cameroonian national languages. The Tupuri-Bango orthography employed in this dataset is characterised by an extended vowel inventory — including the open-mid front unrounded vowel ɛ and the open-mid back rounded vowel ɔ — a set of nasalized vowels represented by the tilde diacritic (ã, ẽ, ũ, õ), with long vowels encoded by vowel doubling (aa, ɛɛ, ɔɔ, ãã, etc.), a series of implosive consonants written with hooked letters (ɓ for the bilabial implosive and ɗ for the alveolar implosive), the velar nasal consonant ŋ (eng), a multi-register tone-marking system combining level (acute, grave) and contour (caron) diacritics applied to vowels, and the apostrophe (', ') as a glottal stop or glottalization marker. The parallel availability of AGLC-transcribed text and aligned speech makes the dataset suitable for a wide range of applications, including text-to-speech (TTS) synthesis, automatic speech recognition (ASR), forced alignment, pronunciation modelling, and language learning tools. It also directly supports efforts to standardise and normalise the digital representation of Tupuri in language technology contexts.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - That you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of any speaker in the dataset; attempting to clone any voice or train models that imitate any speaker in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
(a) Speech-related tasks: - Text-to-speech (TTS) synthesis: The dataset provides clean sentence–audio pairs from multiple recording sessions and is directly suited for training, fine-tuning, and evaluating speech synthesis models for Tupuri-Bango. The availability of AGLC-transcribed sentences with aligned audio enables the development of TTS systems capable of producing natural-sounding Tupuri speech in the Bango variety. - Automatic speech recognition (ASR): Audio–text alignment enables the training and evaluation of speech recognition models for Tupuri. The per-session structure and controlled recording conditions make the dataset suitable for building and evaluating ASR models for this under-resourced language, and directly complements the Tupuri-ASR-Dataset already available on the Mozilla Data Collective platform. - Speech–text alignment / forced alignment benchmarking: Fine-grained audio–text pairing provides ground truth for evaluating phoneme- or word-level aligners adapted to Chadic languages of the Far-North Region of Cameroon. - Pronunciation modelling: The AGLC-transcribed sentences, combined with aligned audio, provide a resource for developing grapheme-to-phoneme (G2P) models and pronunciation lexicons for Tupuri-Bango. (b) Linguistic and lexicographic tasks: - Phonological analysis: The dataset enables systematic study of the phonological and tonal system of Tupuri-Bango, including its tonal contrasts, nasalized vowel inventory (ã, ẽ, ũ, õ), extended vowel inventory (ɛ, ɔ), implosive consonant series (ɓ, ɗ), and the behaviour of long vowels. - Dialectological research: As a dataset explicitly representing the Bango variety of Tupuri, the dataset provides a point of comparison for future documentation of other Tupuri varieties, including Tupuri-Banwere. - Orthographic standardisation and normalisation: The dataset can serve as a reference corpus for evaluating and training text normalisation models aligned with the AGLC standard for Tupuri. - Language documentation: The dataset contributes to the digital documentation of Tupuri-Bango scripted speech in AGLC orthography, supporting efforts to extend the digital presence of this Chadic language of the Far-North Region of Cameroon.
Tupuri (also referred to as Toupouri or Tupure) is a Chadic language belonging to the Afro-Asiatic phylum. It is spoken primarily in the Kaele and Mayo-Danay Divisions of the Far-North Region of Cameroon and in adjacent areas of southern Chad. Despite its regional significance, Tupuri remains substantially underrepresented in language technology resources.
The dialectal situation of Tupuri is likely not fully documented. Two of its most salient varieties are:
Tupuri-Bango (represented by the present dataset)
Tupuri-Banwere
The writing system used for the transcription of Tupuri-Bango in this dataset is the General Alphabet of Cameroon's Languages (AGLC). The AGLC provides a phonologically motivated orthographic standard for Cameroonian national languages and serves as the reference framework for Tupuri literacy materials.
The vowel system attested in the dataset includes the following oral vowels:
a, e, i, o, u, ɛ, ɔ
Where:
ɛ (epsilon): open-mid front unrounded vowel
ɔ (open-o): open-mid back rounded vowel
Nasalized vowels are represented by the tilde diacritic placed directly on the vowel:
ã, ẽ, ũ, õ
Long vowels are represented by vowel doubling (e.g., aa, oo, ɛɛ, ɔɔ, ãã).
The consonant inventory reflected in the dataset includes simple and digraph consonants:
b, c, d, f, g, h, j, k, l, m, n, p, r, s, t, w, y
Implosive consonants (AGLC hooked letters):
ɓ: bilabial implosive (b with hook)
ɗ: alveolar implosive (d with hook)
Special symbols:
ŋ (eng): velar nasal consonant
' or ' (apostrophe): glottal stop or glottalization marker
Tupuri-Bango is a tonal language. The dataset employs systematic tone marking on vowels in accordance with the AGLC convention. The following diacritics are attested in the dataset:
Level tones:
High tone (H): acute accent — á, é, í, ú
Low tone (L): grave accent — à, è, ì, ù
Contour tones:
Rising tone (LH): caron — ě, ǐ, ǔ
Vowels without a tone diacritic are contextually interpreted in accordance with the AGLC convention for Tupuri.
The dataset was compiled from scripted speech prompt lists read by a native speaker. Sentences were selected to provide broad phonological coverage of Tupuri-Bango and were transcribed in accordance with the AGLC orthographic standard.
The dataset represents scripted speech in Tupuri-Bango, covering a broad range of everyday sentence types drawn from a general-purpose TTS/ASR prompt list. All utterances are scripted rather than spontaneous.
Total audio duration: 3,949 seconds (01h 05m 49s), distributed across 983 MP3 audio clips in 10 recording sessions.
The dataset is organised into 10 recording sessions:
Session tts_dataset_tui-bango_01: 100 clips (6m 22s)
Session tts_dataset_tui-bango_02: 98 clips (5m 37s)
Session tts_dataset_tui-bango_03: 100 clips (6m 02s)
Session tts_dataset_tui-bango_04: 100 clips (6m 19s)
Session tts_dataset_tui-bango_05: 100 clips (6m 34s)
Session tts_dataset_tui-bango_06: 100 clips (7m 41s)
Session tts_dataset_tui-bango_07: 100 clips (6m 45s)
Session tts_dataset_tui-bango_08: 100 clips (7m 00s)
Session tts_dataset_tui-bango_09: 100 clips (6m 54s)
Session tts_dataset_tui-bango_10: 85 clips (6m 30s)
Each session folder contains:
MP3 audio clips
One per-session sentence-to-audio mapping file (mapping.tsv), with 4 columns
audio_filename: filename of the audio clip (MP3)
key: unique hash identifier of the recording
sentence: sentence text as read by the speaker, transcribed in AGLC orthography
attempts: number of recording attempts before acceptance
| audio file | sentence (Tupuri-Bango, AGLC) |
|---|---|
| 3fb9f8af5fb47271ed800dc6e2919b89.mp3 | Maysay de sa' hããre |
| ab3cf291f9ca9de98bcda070f4a78fd9.mp3 | Cɔɓwɛ hee taarag ɓaale ɓɛ gà |
| f6d1d31bdf3d26043cf2f1076cddeb0f.mp3 | Ɗee ndu ɓɔ ɓaale gãy maaga ndɔ yiŋ jag la ? |
| 4d94402f53cd5209c693c22049a4d76e.mp3 | Wãy ma de deele cɔŋ wǔr gɔ síigi gà |
| edcfc83ed491a82da3fe40b552fb2ca7.mp3 | Ndo re hɔɔle we ? |