Yoruba-English Code-Switching (YECS) Corpus

Description

The Yoruba-English Code-Switching (YECS) Corpus is a comprehensive, ~120-hour dataset designed to capture the natural linguistic phenomenon of intra-sentential code-mixing. Curated by the LynguaTech Innovative Foundation (LyngualLabs), this dataset provides nearly 100,000 validated audio-text pairs recorded by 140 demographically diverse bilingual speakers in Nigeria. It features clean speech recordings paired with full Yoruba orthography (including verified tonal marks and diacritics), word-level language identification tags, and rich metadata spanning 16 semantic domains and 7 emotion categories. The dataset is explicitly partitioned to prevent data contamination, serving as a highly stratified, robust benchmark for low-resource speech technologies.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

This dataset is published under the Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0). It is openly available and royalty-free for developers, researchers, and startups operating within developing nations to build, modify, and commercialize AI systems. However, commercial entities headquartered in developed, high-income nations must engage in a benefit-sharing agreement with LyngualLabs as stipulated by the NOODL-1.0 framework.

DATASET OVERVIEW & DEMOGRAPHICS

The YECS Corpus provides ~120 hours of high-quality, naturally produced intra-sentential Yoruba-English code-switched speech.

Total Size: 99,930 validated audio-text pairs.
Data Splits: The dataset is partitioned with strict prompt-level disjointness into Training (~95.5 hours), Validation (~12 hours), and Testing (~11.8 hours). To prevent data contamination, the Test set is released as audio-only; ground-truth annotations are retained internally by LyngualLabs for benchmarking.
Item Meta Each utterance includes clean speech recordings (mean SNR of 91.6 dB), full Yoruba orthography, word-level language tags, emotion labels, and domain tags.
Speaker Demographics: Features 140 unique speakers (95 female / 67.9%, 45 male / 32.1%). No single speaker accounts for more than 5% of the total dataset duration.
Language Dominance Breakdown: English-led (49.4%), Yoruba-led (38.6%), Balanced (11.9%), Monolingual (0.01%).

METADATA DISTRIBUTIONS

To ensure robust benchmarking, the dataset is highly stratified across all splits.

1. Gender Distribution

Gender	Overall	Train	Val	Test
Female	94.20h (78.9%)	75.42h	9.47h	9.31h
Male	25.17h (21.1%)	20.15h	2.50h	2.52h

2. Emotion Category Distribution

Emotion	Overall	Train	Val	Test
Neutral	115.37h (96.7%)	92.35h	11.56h	11.46h
Happy	1.32h (1.1%)	1.09h	0.12h	0.11h
Sad	1.19h (1.0%)	0.94h	0.13h	0.12h
Angry	0.72h (0.6%)	0.59h	0.07h	0.07h
Disgusted	0.38h (0.3%)	0.30h	0.04h	0.04h
Surprised	0.21h (0.2%)	0.17h	0.03h	0.01h
Fearful	0.18h (0.2%)	0.15h	0.02h	0.02h

3. Semantic Domain Distribution

Domain	Overall	Train	Val	Test
General	19.56h (16.4%)	15.51h	2.02h	2.03h
Finance	13.66h (11.4%)	11.13h	1.24h	1.28h
Entertainment	11.09h (9.3%)	8.86h	1.19h	1.03h
News	8.65h (7.2%)	6.97h	0.89h	0.79h
History	8.48h (7.1%)	6.82h	0.84h	0.82h
Education	8.09h (6.8%)	6.36h	0.87h	0.85h
Fashion	7.42h (6.2%)	6.06h	0.67h	0.70h
Law	6.87h (5.8%)	5.42h	0.71h	0.74h
Religion	6.67h (5.6%)	5.44h	0.61h	0.61h
Sports	6.37h (5.3%)	4.99h	0.70h	0.67h
Transportation	6.02h (5.0%)	4.79h	0.63h	0.59h
Health	5.89h (4.9%)	4.70h	0.58h	0.62h
Science	5.76h (4.8%)	4.66h	0.55h	0.55h
Family	1.91h (1.6%)	1.51h	0.18h	0.22h
Agriculture	1.55h (1.3%)	1.24h	0.16h	0.15h
Tourism	1.40h (1.2%)	1.11h	0.12h	0.17h

DATA COLLECTION & QUALITY PREPROCESSING

The dataset was curated using an ethical "data farming" framework designed to empower the local community. First, 50 trained bilingual writers generated 51,532 culturally grounded prompts. Linguistic experts validated these for tonal accuracy and applied language/emotion tags. Finally, 140 community speakers recorded the prompts via a device-agnostic web app.

Each prompt was recorded by three distinct speakers to ensure acoustic diversity. A mandatory record-submit-review workflow minimized clipping and background noise. Following strict quality assurance for intelligibility and tonal preservation, non-compliant samples were excluded from the initial raw collection, resulting in this final ~120-hour corpus.