License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmq9hy15t02jpl207wlcqfpxh
Task: ASR
Release Date: 6/11/2026
Format: MP3,, TSV
Size: 82.74 MB
Share
Tupuri-ASR-Dataset is a scripted speech dataset dedicated to the documentation and technological development of Tupuri (ISO 639-3: tui), a Chadic language spoken primarily in the Adamawa and North Regions of Cameroon and in southern Chad. The dataset was compiled at the École Normale Supérieure de Yaoundé, in the department of Cameroonian Languages and Cultures (2026). The dataset comprises 1,800 high-quality MP3 audio recordings of Tupuri sentences read by 16 native speakers across 18 recording sessions, together with per-session sentence-to-audio mapping files enabling precise alignment between textual and acoustic data. Sentences were drawn from a scripted speech prompt list and read by each speaker in a controlled environment. The dataset is structured into two subsets, each representing a distinct dialect of Tupuri: the Bango dialect (700 clips, 40m 42s, 7 sessions) and the Banwere dialect (1,100 clips, 1h 43m 08s, 11 sessions). This dialectal partitioning constitutes a significant added value with respect to the existing Common Voice Scripted Speech 25.0 – Tupuri dataset available on the Mozilla Data Collective platform, which does not distinguish dialect varieties. By providing clearly labelled dialect subsets, the present dataset enables the development and evaluation of speech technology models that are sensitive to inter-dialectal phonological and prosodic variation in Tupuri. A further distinguishing feature of this dataset is its speaker demographics: the majority of recording sessions were contributed by female speakers, a population that is significantly underrepresented in the existing Common Voice Scripted Speech 25.0 – Tupuri resource. This gender balance improves the coverage and representativeness of the dataset for acoustic modelling purposes. The transcription of all sentences follows the General Alphabet of Cameroon's Languages (AGLC; French acronym: AGLC — Alphabet Général des Langues Camerounaises), the reference standard for Cameroonian national languages. From a methodological perspective, the dataset is designed to complement the existing Common Voice Scripted Speech resource for Tupuri rather than to replace it, thereby extending the total amount of available Tupuri speech data. The parallel availability of AGLC-transcribed text and aligned speech makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), forced alignment, pronunciation modelling, dialect classification and language learning tools. It also directly supports efforts to standardise and normalise the digital representation of Tupuri in language technology contexts.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - That you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of any speaker in the dataset; attempting to clone any voice or train models that imitate any speaker in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment enables the evaluation of speech recognition models for Tupuri. The presence of two distinct dialect subsets (Bango and Banwere) makes this dataset particularly suited for building and evaluating dialect-aware ASR models, and for studying the acoustic impact of dialectal variation in Tupuri speech. - Text-to-speech (TTS): The dataset contains clean sentence–audio pairs from multiple speakers across two dialects and can be used to evaluate or fine-tune speech synthesis models for Tupuri. The female-speaker majority broadens the acoustic coverage for TTS experiments. - Dialect classification: The two clearly labelled dialect subsets (Bango and Banwere) provide ground truth for training and evaluating automatic dialect identification systems for Tupuri. - Speech–text alignment / forced alignment benchmarking: Fine-grained audio–text pairing provides ground truth for evaluating phoneme- or word-level aligners adapted to Chadic languages. (b) Linguistic and lexicographic tasks: - Phonological and dialectal analysis: The two dialect subsets enable systematic comparison of phonological, prosodic and lexical features across Bango and Banwere varieties of Tupuri. - Orthographic standardisation and normalisation: The dataset can serve as a reference corpus for evaluating and training text normalisation and grapheme-to-phoneme (G2P) models aligned with the AGLC standard for Tupuri. - Language documentation: The dataset contributes to the digital documentation of Tupuri scripted speech in AGLC orthography, extending the existing Common Voice resource with dialect-differentiated, female-speaker-majority recordings.
Tupuri is a Chadic language belonging to the Afro-Asiatic phylum, classified within the Biu-Mandara branch. It is indigenous to communities located primarily in the Adamawa and North Regions of Cameroon (particularly the Faro-et-Déo and Mayo-Rey Divisions) and in the southern part of Chad (Mayo-Kebbi region). Tupuri is closely related to other Chadic languages of the same area. Ethnologue estimates the number of speakers at approximately 300,000. Despite its significant speaker population, Tupuri remains substantially underrepresented in language technology resources.
Tupuri presents a number of geographically distributed dialectal varieties. The present dataset covers two principal dialects:
Bango dialect: spoken in the Bango area of the Adamawa Region of Cameroon.
Banwere dialect: spoken in the Banwere area of the Adamawa Region of Cameroon.
These two varieties exhibit phonological and prosodic differences that are reflected in the recorded speech and transcriptions. Each dialect is represented as a distinct subset of the dataset, enabling dialect-sensitive modelling and analysis.
The writing system used for the transcription of Tupuri in this dataset is the General Alphabet of Cameroon's Languages (AGLC), as adopted by the Ministry of Basic Education of Cameroon and regularly updated by the Direction de la Promotion des Langues Nationales. The AGLC provides a phonologically motivated orthographic standard for Cameroonian national languages and serves as the reference framework for Tupuri literacy materials.
The vowel system attested in the dataset includes the following oral and nasal vowels:
a, e, ɛ, i, o, ɔ, u
Nasalised vowels are represented with a tilde diacritic (e.g. ã, ẽ, ĩ, õ, ũ). Long vowels are represented by vowel doubling (e.g. aa, ee, oo).
The consonant inventory reflected in the dataset includes simple, prenasalised, implosive and digraph consonants:
b, ɓ, c, d, ɗ, f, g, gb, h, j, k, kp, l, m, mb, n, nd, ng, nj, ŋ, p, r, s, t, w, y
Special symbols: ɓ (bilabial implosive), ɗ (alveolar implosive), ŋ (velar nasal), gb and kp (labial-velar consonants)
Tupuri is a tonal language with lexical and grammatical contrastive tones. The dataset employs systematic tone marking on vowels in accordance with the AGLC convention:
High tone (H): á, é, ɛ́, í, ó, ɔ́, ú
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù
Falling tone (HL): â, ê, ɛ̂, î, ô, ɔ̂, û
Rising tone (LH): ǎ, ě, ɛ̌, ǐ, ǒ, ɔ̌, ǔ
The dataset was compiled from scripted speech prompt lists read by native speakers of Tupuri in recording sessions held at the École Normale Supérieure de Yaoundé in 2026, in the framework of the Mozilla Data Collective project. Sentences were selected to provide broad phonological coverage of Tupuri across both dialects and were transcribed in accordance with the AGLC orthographic standard.
The dataset represents scripted speech in Tupuri (Bango and Banwere dialects), covering a broad range of everyday sentence types drawn from a general-purpose ASR/TTS prompt list. All utterances are scripted rather than spontaneous.
Total audio duration: 8,630 seconds (02h 23m 50s), distributed across 1,800 MP3 audio clips in 18 recording sessions contributed by 16 native speakers of Tupuri.
Bango subset: 2,442 seconds (00h 40m 42s), 700 MP3 clips, 7 sessions.
Banwere subset: 6,188 seconds (01h 43m 08s), 1,100 MP3 clips, 11 sessions.
The dataset is organised into two top-level dialect subsets:
Tupuri-Bango subset — 700 MP3 audio clips read by 6 native speakers of the Bango dialect, with a total duration of 2,442 seconds (00h 40m 42s), distributed across 7 recording sessions (one speaker contributed two sessions):
Session tui-bango_02: 100 clips (07m 12s)
Session tui-bango_03: 100 clips (07m 03s)
Session tui-bango_03-1: 100 clips (04m 32s)
Session tui-bango_05: 100 clips (05m 03s)
Session tui-bango_06: 100 clips (05m 25s)
Session tui-bango_07: 100 clips (05m 18s)
Session tui-bango_08: 100 clips (06m 06s)
Tupuri-Banwere subset — 1,100 MP3 audio clips read by 10 native speakers of the Banwere dialect, with a total duration of 6,188 seconds (01h 43m 08s), distributed across 11 recording sessions (one speaker contributed two sessions):
Session tui-banwere_01: 100 clips (10m 11s)
Session tui-banwere_02: 100 clips (09m 18s)
Session tui-banwere_03: 100 clips (14m 43s)
Session tui-banwere_04: 100 clips (14m 43s)
Session tui-banwere_05: 100 clips (06m 22s)
Session tui-banwere_06: 100 clips (05m 41s)
Session tui-banwere_07: 100 clips (07m 24s)
Session tui-banwere_08: 100 clips (04m 51s)
Session tui-banwere_08-1: 100 clips (06m 03s)
Session tui-banwere_09: 100 clips (18m 04s)
Session tui-banwere_10: 100 clips (05m 44s)
Each session folder contains:
100 MP3 audio clips
One per-session sentence-to-audio mapping file (mapping.tsv), with 4 columns
#audio_filename: filename of the audio clip (MP3)
#key: unique hash identifier of the recording
#sentence: sentence text as read by the speaker, transcribed in AGLC orthography
#attempts: number of recording attempts before acceptance
Bango dialect:
| audio file | sentence (Tupuri/Bango, AGLC) |
|---|---|
| b5351795ce4014109b1341538475b3c9.mp3 | Á bay wɔ wɔge ti lakɔl gɔ blam wa hase |
| 6d646be9a9eb647ab1da6fa13d9242ee.mp3 | À laa lɛ gɔ líŋ ti lakɔl de 15h30. |
| 36bb4683202861f067a834b96564db9e.mp3 | Taŋgu maaga wùr raw líŋ, wùr yɔg ɓil tiŋ lakɔl gɔ ɗa. |
| 15ee0eb81f6b1b027c3efec2a814fc86.mp3 | jobo de wee coore |
Banwere dialect:
| audio file | sentence (Tupuri/Banwere, AGLC) |
|---|---|
| c9064c51ac1629957611685b5b713efc.mp3 | Wɔ siigi , fẽẽre tii wɔ . |
| b21c9d46a89551d38040783c26c5abb6.mp3 | Huuli , baa , lɛɛge ɗaw woo kẽẽ . |
| b190a612efce64a662678fcc50c1048e.mp3 | Ngl ciŋ kaŋ gɔ koɓ koɓe . |
| 626336964ac491d59706b95da5a233ed.mp3 | Ngɛl de suŋ kɔl do bay ne no |