Gbaya-Lay_ALCAM-MultimodalDataset

Description

Gbaya-Lay_ALCAM-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of Gbaya-Lay (Làì), a variety of Northwest Gbaya (ISO 639-3: gya), a Niger-Congo language spoken in Cameroon and the Central African Republic. Gbaya-Lay is a localised and socially embedded speech form that is rarely represented in standard grammatical descriptions or lexicographical resources. The dataset comprises three closely aligned components: (i) a structured datasheet containing carefully selected example sentences reflecting usage in Gbaya-Lay; (ii) high-quality audio recordings of these sentences, produced by a native speaker across three recording sessions; and (iii) explicit audio–sentence mapping files enabling precise alignment between the textual and acoustic data. The dataset's primary added value lies in its explicit focus on the Gbaya-Lay variety of Northwest Gbaya. Gbaya-Lay (Làì) is classified as a dialect of Northwest Gbaya (gya) by the Ethnologue (https://www.ethnologue.com/language/gya/) and is referenced in the standard atlases of Cameroon's languages: the Atlas Linguistique du Cameroun by Breton and Bikia Fohtung (1991) and the Atlas Linguistique de l'Afrique Centrale: le Cameroun by Bibam Bikoi (2012). Gbaya-Lay is geographically restricted to a small area north of Mbodomo, in Cameroon, distinguishing it from the more widely spoken Gbaya-Kara (Kàrà) variety. Like many geographically and socially situated language varieties, Gbaya-Lay typically remains invisible in reference grammars, dictionaries and educational materials that often privilege better-documented or more standardised forms of the language. The dataset captures micro-variation in phonetics, phonology, morphosyntax and lexical choice that are essential for understanding socially situated linguistic practices rather than a homogeneous, abstract system. In this sense, the dataset contributes to a more inclusive representation of linguistic diversity within the Northwest Gbaya speech community. From a methodological perspective, the dataset is designed to bridge the gap between language documentation and language technology. The parallel availability of text in Gbaya-Lay and in French, alongside aligned speech, makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling and multimodal language learning tools. At the same time, the structured datasheet supports linguistic analysis, contrastive studies with other language varieties and pedagogical uses in teacher training and language revitalisation contexts. More broadly, the Gbaya-Lay_ALCAM-MultimodalDataset exemplifies an approach to African language resources that highlights fluidity, longitudinal variation, orality and community-based practice.

Language

Gbaya-Lay (Làì) is classified as a dialect of Northwest Gbaya (ISO 639-3: gya) by the Ethnologue (https://www.ethnologue.com/language/gya/) and is referenced in the two standard atlases of Cameroon's languages: the Atlas Linguistique du Cameroun (Breton and Bikia Fohtung 1991) and the Atlas Linguistique de l'Afrique Centrale: le Cameroun (Bibam Bikoi 2012). Northwest Gbaya belongs to the Gbaya branch of the Atlantic-Congo sub-family of the Niger-Congo language family. Northwest Gbaya speakers are located primarily across a broad expanse of Cameroon and the Central African Republic, with a smaller community in Congo. The principal variety is Gbaya-Kara (Kàrà); Gbaya-Lay (Làì) is restricted to a small area north of Mbodomo, Cameroon. The total number of Northwest Gbaya speakers is estimated at approximately 65,000 in Cameroon (1980 figures) and 200,000 in the Central African Republic (1996 figures).

Variants

At the time of publication of this dataset, we do not have a precise idea of the full scope of variation of Gbaya-Lay, a variety which is itself considered a component of the Northwest Gbaya dialect continuum. The relationship between Gbaya-Lay and other attested varieties of Northwest Gbaya (e.g. Gbaya-Kara, and an unnamed third variety situated between Gbaya-Lay and Toongo) has not been systematically characterised in the available literature.

Writing System

The writing system used for the transcription of Gbaya-Lay in this dataset is the International Phonetic Alphabet (IPA), as reflected in lexical entries (Word) and sentence-level examples (LangEx) in the datasheet. This is consistent with the transcription practice adopted by Paulette Roulon-Doko in her reference works on Northwest Gbaya. The phonological inventory described below is derived directly from the attested forms in the LangEx and Word columns of the datasheet.

An alternative orthography based on the General Alphabet of Cameroonian Languages (GACL) exists for Northwest Gbaya and is used notably in the Bible translation published by the Alliance biblique du Cameroun; however, the present dataset uses IPA-based phonetic transcription throughout.

1. Vowels

The oral vowel system attested in the dataset is as follows:

i, e, ɛ, a, ɔ, o, u

Long vowels are also attested (e.g. gbaa, zɔɔ, bɛɛ, bɔɔ, hii). Nasalised vowels are attested in a restricted set of lexical items (e.g. lɛ̃́).

2. Consonants

The consonant inventory reflected in the dataset includes the following simple, prenasalised, labialised and other consonants:

b, ɓ, d, ɗ, f, g, gb, h, k, kp, l, m, mb, n, nd, ng, ngb, ɲ, ŋ, p, r, R, s, t, v, w, y, z

The uppercase R in the dataset represents a retroflex or uvular rhotic, distinct from the alveolar tap/trill r, consistent with phonological distinctions described for Northwest Gbaya. Implosives ɓ and ɗ, prenasalised stops (mb, nd, ng, ngb) and labial-velar stops (gb, kp) are characteristic features of the Northwest Gbaya consonant system.

3. Tone system

The datasheet shows lexical and grammatical contrastive tones, marked directly on vowels. The following tonal categories are attested in the LangEx column:

High tone (H): á, é, ɛ́, í, ó, ɔ́, ú, ń, ḿ
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù
Falling contour tone (HL): â, ê, î, ô, ɔ̂, û, ɛ̂
Mid/level tone: marked with macron on a restricted set of items (e.g. lɔ̄h, zāŋ, kpāā, bɔ̄ɔ̄ŋ)

Unmarked vowels represent tonally neutral or contextually determined syllables.

Source

The dataset was collected through a questionnaire designed to gather basic information about the Gbaya-Lay lexicon and grammar. This was done as part of the Atlas Linguistique du Cameroun (ALCAM) project.

Domain

The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.

Size

Total size is approximately 12 MB (uncompressed), comprising approximately 11.7 MB of MP3 audio files and approximately 52 KB of TSV data files.

The total duration of the 262 audio recordings across three recording sessions is 3351 seconds (55 minutes 51 seconds).

Structure

The dataset comprises: 1) a datasheet (Gbaya-Lay_ALCAM-MultimodalDataset.tsv) with 374 lines and 20 columns; 2) 262 voice clips read by a single native speaker, stored as MP3 files across three tts_dataset subfolders, each with its own sentence-to-audio mapping file (mapping.tsv).

Recording sessions

gbaya_yay_tts_dataset01_94clips_1046s_20260526-1741: 94 clips, 1046 seconds
gbaya_lay_tts_dataset02_88clips_950s_20260526-2006: 88 clips, 950 seconds
gbaya_lay_tts_dataset03_80clips_1355s_20260526-2247: 80 clips, 1355 seconds

Description of columns (datasheet)

#OrigID: original number of lexical entry on paper questionnaire
#EditID: modification of #OrigID
#FrenchRef: reference entry (originally provided in French)
#FrenchComm: original comments about reference entry (#FrenchRef)
#French: lexical entry in French (overlaps with #FrenchRef)
#Note: note of researcher on the lexical entry
#POS: part of speech
#Class: noun class (where applicable)
#Morf: morphological attribute (e.g. plural, singular)
#Var: (na)
#Word: lexical entry in Gbaya-Lay
#CrossRef: cross-referencing of lexical entry number
#FrenchEx: example sentence in French
#LangEx: example sentence in Gbaya-Lay
#LangExEdit: manual editing of #LangEx
#FrenchExEdit: edited French equivalent of #FrenchEx
#LangPars: word-for-word parsing in Gbaya-Lay
#LangParsEdit: editing of #LangPars
#FrenchPars: French equivalent of #LangParsEdit
#FrenchParsEdit: editing of #FrenchPars

Description of columns (mapping file)

#audio_filename: name of the MP3 audio file
#key: MD5-based identifier shared with the audio filename (without extension)
#sentence: lexical item and/or example sentence read by the speaker
#attempts: number of recording attempts before the selected take

Sample

audio file	words & sentences
c169b22c4e0739448098b7704b2f7b55.mp3	nú ; a aá nɛ bé nú
a7dbfc164ff740efa6ad37fe5ff3152f.mp3	o nú
cac0544662ad035f626abf524d700c40.mp3	gbaa Rí ; wa duŋa Ríŋ nɛ́ Rí-wa
696b454d98300fb0c2c6a22c9309eb43.mp3	zu ; a aá nɛ́ gásá zu ín bɔ́ŋɛ́ gɛ́i
5481873f9a14e45479342cbff3791a40.mp3	búmá-tɛ ; búmá lɛ̃́-a nɛ lúa
9d78d54578f502ee0040dda649b0f361.mp3	léɓé ; a ɲɔŋa léɓé-a
18c2c3c9c8ed67252f172353a3a3a355.mp3	zɔɔ ; a nɔ́n nɛ́ zɔ-áa
910aba6f614ee5a6c8977a6d2bf243c3.mp3	gɛr(-a) ; gɛ́r-a haá haá
6bcc88c7a6f6e96ff6cb1c8c266a651b.mp3	kú ; a gusa kú-a
a68294425e3280901420c61ed5d68156.mp3	dɔ̂r ; ḿ' kpāā dɔ̂r fɔrɔ