License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmpwidn0500zpo0079wsj1phv
Task: NLP
Release Date: 6/2/2026
Format: MP3, TSV
Size: 9.23 MB
Share
Gbaya-Lay_ALCAM-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of Gbaya-Lay (Làì), a variety of Northwest Gbaya (ISO 639-3: gya), a Niger-Congo language spoken in Cameroon and the Central African Republic. Gbaya-Lay is a localised and socially embedded speech form that is rarely represented in standard grammatical descriptions or lexicographical resources. The dataset comprises three closely aligned components: (i) a structured datasheet containing carefully selected example sentences reflecting usage in Gbaya-Lay; (ii) high-quality audio recordings of these sentences, produced by a native speaker across three recording sessions; and (iii) explicit audio–sentence mapping files enabling precise alignment between the textual and acoustic data. The dataset's primary added value lies in its explicit focus on the Gbaya-Lay variety of Northwest Gbaya. Gbaya-Lay (Làì) is classified as a dialect of Northwest Gbaya (gya) by the Ethnologue (https://www.ethnologue.com/language/gya/) and is referenced in the standard atlases of Cameroon's languages: the Atlas Linguistique du Cameroun by Breton and Bikia Fohtung (1991) and the Atlas Linguistique de l'Afrique Centrale: le Cameroun by Bibam Bikoi (2012). Gbaya-Lay is geographically restricted to a small area north of Mbodomo, in Cameroon, distinguishing it from the more widely spoken Gbaya-Kara (Kàrà) variety. Like many geographically and socially situated language varieties, Gbaya-Lay typically remains invisible in reference grammars, dictionaries and educational materials that often privilege better-documented or more standardised forms of the language. The dataset captures micro-variation in phonetics, phonology, morphosyntax and lexical choice that are essential for understanding socially situated linguistic practices rather than a homogeneous, abstract system. In this sense, the dataset contributes to a more inclusive representation of linguistic diversity within the Northwest Gbaya speech community. From a methodological perspective, the dataset is designed to bridge the gap between language documentation and language technology. The parallel availability of text in Gbaya-Lay and in French, alongside aligned speech, makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling and multimodal language learning tools. At the same time, the structured datasheet supports linguistic analysis, contrastive studies with other language varieties and pedagogical uses in teacher training and language revitalisation contexts. More broadly, the Gbaya-Lay_ALCAM-MultimodalDataset exemplifies an approach to African language resources that highlights fluidity, longitudinal variation, orality and community-based practice.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - that you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: determining the identity of the speaker in the dataset; attempt to clone the voice or train models that imitate the speaker in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Intended Use
Intended Usage (a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment allows the evaluation of speech recognition models for Gbaya-Lay. However, it should be noted that the sentences are transcribed phonetically using the IPA. An orthographic standard based on the General Alphabet of Cameroon's Languages exists for Northwest Gbaya but is not used in this dataset. - Text-to-speech (TTS): As the dataset contains clean sentence–audio pairs, it can also be used to evaluate speech synthesis or text-to-speech models. Here again, it should be noted that the alphabet used to write the sentences is the IPA and not the General Alphabet of Cameroon's Languages or any other conventional orthography for Northwest Gbaya. - Speech–text alignment/forced alignment benchmarking: Fine-grained, word-level segmentation provides ideal ground truth for evaluating phoneme- or word-level aligners. (b) Translation and multilingual tasks: - Machine translation (Gbaya-Lay ↔ French): The sentence-level alignment between Gbaya-Lay and French makes it a parallel corpus for evaluating translation models, with the caveat of the phonetic orthographic standard employed. - Speech translation (speech-to-text) (c) Linguistic and lexicographic tasks: - Morphological analysis/glossed corpus studies: The morpheme-level glosses and grammatical data are valuable for computational morphology, interlinear text modelling (ILTs) and grammar induction tasks for Gbaya-Lay and related Northwest Gbaya varieties. - Lexicon and part-of-speech tagging: These are useful for building linguistic resources such as dictionaries, morphological analysers or taggers for Gbaya-Lay.
Gbaya-Lay (Làì) is classified as a dialect of Northwest Gbaya (ISO 639-3: gya) by the Ethnologue (https://www.ethnologue.com/language/gya/) and is referenced in the two standard atlases of Cameroon's languages: the Atlas Linguistique du Cameroun (Breton and Bikia Fohtung 1991) and the Atlas Linguistique de l'Afrique Centrale: le Cameroun (Bibam Bikoi 2012). Northwest Gbaya belongs to the Gbaya branch of the Atlantic-Congo sub-family of the Niger-Congo language family. Northwest Gbaya speakers are located primarily across a broad expanse of Cameroon and the Central African Republic, with a smaller community in Congo. The principal variety is Gbaya-Kara (Kàrà); Gbaya-Lay (Làì) is restricted to a small area north of Mbodomo, Cameroon. The total number of Northwest Gbaya speakers is estimated at approximately 65,000 in Cameroon (1980 figures) and 200,000 in the Central African Republic (1996 figures).
At the time of publication of this dataset, we do not have a precise idea of the full scope of variation of Gbaya-Lay, a variety which is itself considered a component of the Northwest Gbaya dialect continuum. The relationship between Gbaya-Lay and other attested varieties of Northwest Gbaya (e.g. Gbaya-Kara, and an unnamed third variety situated between Gbaya-Lay and Toongo) has not been systematically characterised in the available literature.
The writing system used for the transcription of Gbaya-Lay in this dataset is the International Phonetic Alphabet (IPA), as reflected in lexical entries (Word) and sentence-level examples (LangEx) in the datasheet. This is consistent with the transcription practice adopted by Paulette Roulon-Doko in her reference works on Northwest Gbaya. The phonological inventory described below is derived directly from the attested forms in the LangEx and Word columns of the datasheet.
An alternative orthography based on the General Alphabet of Cameroonian Languages (GACL) exists for Northwest Gbaya and is used notably in the Bible translation published by the Alliance biblique du Cameroun; however, the present dataset uses IPA-based phonetic transcription throughout.
The oral vowel system attested in the dataset is as follows:
i, e, ɛ, a, ɔ, o, u
Long vowels are also attested (e.g. gbaa, zɔɔ, bɛɛ, bɔɔ, hii). Nasalised vowels are attested in a restricted set of lexical items (e.g. lɛ̃́).
The consonant inventory reflected in the dataset includes the following simple, prenasalised, labialised and other consonants:
b, ɓ, d, ɗ, f, g, gb, h, k, kp, l, m, mb, n, nd, ng, ngb, ɲ, ŋ, p, r, R, s, t, v, w, y, z
The uppercase R in the dataset represents a retroflex or uvular rhotic, distinct from the alveolar tap/trill r, consistent with phonological distinctions described for Northwest Gbaya. Implosives ɓ and ɗ, prenasalised stops (mb, nd, ng, ngb) and labial-velar stops (gb, kp) are characteristic features of the Northwest Gbaya consonant system.
The datasheet shows lexical and grammatical contrastive tones, marked directly on vowels. The following tonal categories are attested in the LangEx column:
High tone (H): á, é, ɛ́, í, ó, ɔ́, ú, ń, ḿ
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù
Falling contour tone (HL): â, ê, î, ô, ɔ̂, û, ɛ̂
Mid/level tone: marked with macron on a restricted set of items (e.g. lɔ̄h, zāŋ, kpāā, bɔ̄ɔ̄ŋ)
Unmarked vowels represent tonally neutral or contextually determined syllables.
The dataset was collected through a questionnaire designed to gather basic information about the Gbaya-Lay lexicon and grammar. This was done as part of the Atlas Linguistique du Cameroun (ALCAM) project.
The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.
Total size is approximately 12 MB (uncompressed), comprising approximately 11.7 MB of MP3 audio files and approximately 52 KB of TSV data files.
The total duration of the 262 audio recordings across three recording sessions is 3351 seconds (55 minutes 51 seconds).
The dataset comprises: 1) a datasheet (Gbaya-Lay_ALCAM-MultimodalDataset.tsv) with 374 lines and 20 columns; 2) 262 voice clips read by a single native speaker, stored as MP3 files across three tts_dataset subfolders, each with its own sentence-to-audio mapping file (mapping.tsv).
gbaya_yay_tts_dataset01_94clips_1046s_20260526-1741: 94 clips, 1046 seconds
gbaya_lay_tts_dataset02_88clips_950s_20260526-2006: 88 clips, 950 seconds
gbaya_lay_tts_dataset03_80clips_1355s_20260526-2247: 80 clips, 1355 seconds
#OrigID: original number of lexical entry on paper questionnaire
#EditID: modification of #OrigID
#FrenchRef: reference entry (originally provided in French)
#FrenchComm: original comments about reference entry (#FrenchRef)
#French: lexical entry in French (overlaps with #FrenchRef)
#Note: note of researcher on the lexical entry
#POS: part of speech
#Class: noun class (where applicable)
#Morf: morphological attribute (e.g. plural, singular)
#Var: (na)
#Word: lexical entry in Gbaya-Lay
#CrossRef: cross-referencing of lexical entry number
#FrenchEx: example sentence in French
#LangEx: example sentence in Gbaya-Lay
#LangExEdit: manual editing of #LangEx
#FrenchExEdit: edited French equivalent of #FrenchEx
#LangPars: word-for-word parsing in Gbaya-Lay
#LangParsEdit: editing of #LangPars
#FrenchPars: French equivalent of #LangParsEdit
#FrenchParsEdit: editing of #FrenchPars
#audio_filename: name of the MP3 audio file
#key: MD5-based identifier shared with the audio filename (without extension)
#sentence: lexical item and/or example sentence read by the speaker
#attempts: number of recording attempts before the selected take
| audio file | words & sentences |
|---|---|
| c169b22c4e0739448098b7704b2f7b55.mp3 | nú ; a aá nɛ bé nú |
| a7dbfc164ff740efa6ad37fe5ff3152f.mp3 | o nú |
| cac0544662ad035f626abf524d700c40.mp3 | gbaa Rí ; wa duŋa Ríŋ nɛ́ Rí-wa |
| 696b454d98300fb0c2c6a22c9309eb43.mp3 | zu ; a aá nɛ́ gásá zu ín bɔ́ŋɛ́ gɛ́i |
| 5481873f9a14e45479342cbff3791a40.mp3 | búmá-tɛ ; búmá lɛ̃́-a nɛ lúa |
| 9d78d54578f502ee0040dda649b0f361.mp3 | léɓé ; a ɲɔŋa léɓé-a |
| 18c2c3c9c8ed67252f172353a3a3a355.mp3 | zɔɔ ; a nɔ́n nɛ́ zɔ-áa |
| 910aba6f614ee5a6c8977a6d2bf243c3.mp3 | gɛr(-a) ; gɛ́r-a haá haá |
| 6bcc88c7a6f6e96ff6cb1c8c266a651b.mp3 | kú ; a gusa kú-a |
| a68294425e3280901420c61ed5d68156.mp3 | dɔ̂r ; ḿ' kpāā dɔ̂r fɔrɔ |