Tupuri-ASR-Dataset

Description

Tupuri-ASR-Dataset is a scripted speech dataset dedicated to the documentation and technological development of Tupuri (ISO 639-3: tui), a Chadic language spoken primarily in the Adamawa and North Regions of Cameroon and in southern Chad. The dataset was compiled in the framework of the Mozilla Data Collective initiative (2026). The dataset comprises 2,085 high-quality MP3 audio recordings of Tupuri sentences read by native speakers across 21 recording sessions, together with per-session sentence-to-audio mapping files enabling precise alignment between textual and acoustic data. Sentences were drawn from a scripted speech prompt list and read by each speaker in a controlled environment. The dataset is structured into two subsets, each representing a distinct dialect of Tupuri: the Bango dialect (985 clips, 1h 02m 21s, 10 sessions, 9 speakers) and the Banwere dialect (1,100 clips, 1h 43m 40s, 11 sessions, 10 speakers). This dialectal partitioning constitutes a significant added value with respect to the existing Common Voice Scripted Speech 25.0 – Tupuri dataset available on the Mozilla Data Collective platform, which does not distinguish dialect varieties. By providing clearly labelled dialect subsets, the present dataset enables the development and evaluation of speech technology models that are sensitive to inter-dialectal phonological and prosodic variation in Tupuri. A further distinguishing feature of this dataset is its speaker demographics: the majority of recording sessions were contributed by female speakers, a population that is significantly underrepresented in the existing Common Voice Scripted Speech 25.0 – Tupuri resource. This gender balance improves the coverage and representativeness of the dataset for acoustic modelling purposes. The transcription of all sentences follows the General Alphabet of Cameroon's Languages (AGLC; French acronym: AGLC — Alphabet Général des Langues Camerounaises), the reference standard for Cameroonian national languages. From a methodological perspective, the dataset is designed to complement the existing Common Voice Scripted Speech resource for Tupuri rather than to replace it, thereby extending the total amount of available Tupuri speech data. The parallel availability of AGLC-transcribed text and aligned speech makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), forced alignment, pronunciation modelling, dialect classification and language learning tools. It also directly supports efforts to standardise and normalise the digital representation of Tupuri in language technology contexts.

audio file	sentence (Tupuri/Bango, AGLC)
b5351795ce4014109b1341538475b3c9.mp3	Á bay wɔ wɔge ti lakɔl gɔ blam wa hase
6d646be9a9eb647ab1da6fa13d9242ee.mp3	À laa lɛ gɔ líŋ ti lakɔl de 15h30.
36bb4683202861f067a834b96564db9e.mp3	Taŋgu maaga wùr raw líŋ, wùr yɔg ɓil tiŋ lakɔl gɔ ɗa.
15ee0eb81f6b1b027c3efec2a814fc86.mp3	jobo de wee coore

audio file

sentence (Tupuri/Bango, AGLC)

b5351795ce4014109b1341538475b3c9.mp3

Á bay wɔ wɔge ti lakɔl gɔ blam wa hase

6d646be9a9eb647ab1da6fa13d9242ee.mp3

À laa lɛ gɔ líŋ ti lakɔl de 15h30.

36bb4683202861f067a834b96564db9e.mp3

Taŋgu maaga wùr raw líŋ, wùr yɔg ɓil tiŋ lakɔl gɔ ɗa.

15ee0eb81f6b1b027c3efec2a814fc86.mp3

jobo de wee coore

audio file	sentence (Tupuri/Banwere, AGLC)
c9064c51ac1629957611685b5b713efc.mp3	Wɔ siigi , fẽẽre tii wɔ .
b21c9d46a89551d38040783c26c5abb6.mp3	Huuli , baa , lɛɛge ɗaw woo kẽẽ .
b190a612efce64a662678fcc50c1048e.mp3	Ngl ciŋ kaŋ gɔ koɓ koɓe .
626336964ac491d59706b95da5a233ed.mp3	Ngɛl de suŋ kɔl do bay ne no

audio file

sentence (Tupuri/Banwere, AGLC)

c9064c51ac1629957611685b5b713efc.mp3

Wɔ siigi , fẽẽre tii wɔ .

b21c9d46a89551d38040783c26c5abb6.mp3

Huuli , baa , lɛɛge ɗaw woo kẽẽ .

b190a612efce64a662678fcc50c1048e.mp3

Ngl ciŋ kaŋ gɔ koɓ koɓe .

626336964ac491d59706b95da5a233ed.mp3

Ngɛl de suŋ kɔl do bay ne no

Description

Specifics

Considerations

Processes

Metadata

Language

Variants

Writing System

1. Vowels

2. Consonants

3. Tone system

Source

Domain

Size

Structure

Description of columns (mapping.tsv)

Sample