Yoruba-English Code-Switching (YECS) Corpus
License:
NOODL-1.0
Steward:
LyngualLabsTask: ASR
Release Date: 4/15/2026
Format: WAV, CSV
Size: 9.71 GB
Share
Description
The Yoruba-English Code-Switching (YECS) Corpus is a comprehensive, ~120-hour dataset designed to capture the natural linguistic phenomenon of intra-sentential code-mixing. Curated by the LynguaTech Innovative Foundation (LyngualLabs), this dataset provides nearly 100,000 validated audio-text pairs recorded by 140 demographically diverse bilingual speakers in Nigeria. It features clean speech recordings paired with full Yoruba orthography (including verified tonal marks and diacritics), word-level language identification tags, and rich metadata spanning 16 semantic domains and 7 emotion categories. The dataset is explicitly partitioned to prevent data contamination, serving as a highly stratified, robust benchmark for low-resource speech technologies.
Specifics
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseConsiderations
Restrictions/Special Constraints
This dataset is published under the Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0). It is openly available and royalty-free for developers, researchers, and startups operating within developing nations to build, modify, and commercialize AI systems. However, commercial entities headquartered in developed, high-income nations must engage in a benefit-sharing agreement with LyngualLabs as stipulated by the NOODL-1.0 framework.
Forbidden Usage
You agree not to attempt to determine the identity of any speakers or contributors in this dataset. Any attempt to clone voices or train generative AI models (e.g., deepfakes, voice replication) that specifically imitate the individual speakers in this dataset is strictly forbidden. It is forbidden to use this dataset as the sole training source for high-stakes, safety-critical systems (e.g., medical diagnoses, legal transcription, or emergency routing systems) where zero-error tolerance is required. The dataset does not cover trilingual mixing (e.g., Nigerian Pidgin) and should not be misconstrued as a general West African multilingual benchmark.
Processes
Ethical Review
This dataset was curated using an ethical "data farming" framework prioritizing community reciprocity. Contributors were actively recruited from local Nigerian communities, fairly compensated for their time and expertise, and provided with capacity-building training. A strict daily recording cap was enforced to protect speaker vocal well-being. Prior to participation, all contributors provided explicit, informed consent for their data to be collected, used, and hosted publicly on the Mozilla Data Collective for global research. Before release, the dataset underwent a rigorous "PII Scrub" to permanently remove all personally identifiable information, and a final transparency notice was issued to contributors to ensure compliance with the Nigerian Data Protection Act (NDPA 2023).
Intended Use
This dataset is intended for training and evaluating robust Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) systems specifically designed for low-resource, code-switched languages. Furthermore, it serves as a benchmark for downstream NLP tasks such as word-level Language Identification (LID) and Speech Emotion Recognition (SER), and provides a rich resource for academic linguistic research analyzing high-density code-switching and complex tonal/prosodic interactions.
Metadata
DATASET OVERVIEW & DEMOGRAPHICS
The YECS Corpus provides ~120 hours of high-quality, naturally produced intra-sentential Yoruba-English code-switched speech.
Total Size: 99,930 validated audio-text pairs.
Data Splits: The dataset is partitioned with strict prompt-level disjointness into Training (~95.5 hours), Validation (~12 hours), and Testing (~11.8 hours). To prevent data contamination, the Test set is released as audio-only; ground-truth annotations are retained internally by LyngualLabs for benchmarking.
Item Meta Each utterance includes clean speech recordings (mean SNR of 91.6 dB), full Yoruba orthography, word-level language tags, emotion labels, and domain tags.
Speaker Demographics: Features 140 unique speakers (95 female / 67.9%, 45 male / 32.1%). No single speaker accounts for more than 5% of the total dataset duration.
Language Dominance Breakdown: English-led (49.4%), Yoruba-led (38.6%), Balanced (11.9%), Monolingual (0.01%).
METADATA DISTRIBUTIONS
To ensure robust benchmarking, the dataset is highly stratified across all splits.
1. Gender Distribution
| Gender | Overall | Train | Val | Test |
|---|---|---|---|---|
| Female | 94.20h (78.9%) | 75.42h | 9.47h | 9.31h |
| Male | 25.17h (21.1%) | 20.15h | 2.50h | 2.52h |
2. Emotion Category Distribution
| Emotion | Overall | Train | Val | Test |
|---|---|---|---|---|
| Neutral | 115.37h (96.7%) | 92.35h | 11.56h | 11.46h |
| Happy | 1.32h (1.1%) | 1.09h | 0.12h | 0.11h |
| Sad | 1.19h (1.0%) | 0.94h | 0.13h | 0.12h |
| Angry | 0.72h (0.6%) | 0.59h | 0.07h | 0.07h |
| Disgusted | 0.38h (0.3%) | 0.30h | 0.04h | 0.04h |
| Surprised | 0.21h (0.2%) | 0.17h | 0.03h | 0.01h |
| Fearful | 0.18h (0.2%) | 0.15h | 0.02h | 0.02h |
3. Semantic Domain Distribution
| Domain | Overall | Train | Val | Test |
|---|---|---|---|---|
| General | 19.56h (16.4%) | 15.51h | 2.02h | 2.03h |
| Finance | 13.66h (11.4%) | 11.13h | 1.24h | 1.28h |
| Entertainment | 11.09h (9.3%) | 8.86h | 1.19h | 1.03h |
| News | 8.65h (7.2%) | 6.97h | 0.89h | 0.79h |
| History | 8.48h (7.1%) | 6.82h | 0.84h | 0.82h |
| Education | 8.09h (6.8%) | 6.36h | 0.87h | 0.85h |
| Fashion | 7.42h (6.2%) | 6.06h | 0.67h | 0.70h |
| Law | 6.87h (5.8%) | 5.42h | 0.71h | 0.74h |
| Religion | 6.67h (5.6%) | 5.44h | 0.61h | 0.61h |
| Sports | 6.37h (5.3%) | 4.99h | 0.70h | 0.67h |
| Transportation | 6.02h (5.0%) | 4.79h | 0.63h | 0.59h |
| Health | 5.89h (4.9%) | 4.70h | 0.58h | 0.62h |
| Science | 5.76h (4.8%) | 4.66h | 0.55h | 0.55h |
| Family | 1.91h (1.6%) | 1.51h | 0.18h | 0.22h |
| Agriculture | 1.55h (1.3%) | 1.24h | 0.16h | 0.15h |
| Tourism | 1.40h (1.2%) | 1.11h | 0.12h | 0.17h |
DATA COLLECTION & QUALITY PREPROCESSING
The dataset was curated using an ethical "data farming" framework designed to empower the local community. First, 50 trained bilingual writers generated 51,532 culturally grounded prompts. Linguistic experts validated these for tonal accuracy and applied language/emotion tags. Finally, 140 community speakers recorded the prompts via a device-agnostic web app.
Each prompt was recorded by three distinct speakers to ensure acoustic diversity. A mandatory record-submit-review workflow minimized clipping and background noise. Following strict quality assurance for intelligibility and tonal preservation, non-compliant samples were excluded from the initial raw collection, resulting in this final ~120-hour corpus.