Yoruba-English Code-Switching (YECS) Corpus

License icon

License:

NOODL-1.0

Shield icon

Steward:

LyngualLabs

Task: ASR

Release Date: 4/15/2026

Format: WAV, CSV

Size: 9.71 GB


Share

Description

The Yoruba-English Code-Switching (YECS) Corpus is a comprehensive, ~120-hour dataset designed to capture the natural linguistic phenomenon of intra-sentential code-mixing. Curated by the LynguaTech Innovative Foundation (LyngualLabs), this dataset provides nearly 100,000 validated audio-text pairs recorded by 140 demographically diverse bilingual speakers in Nigeria. It features clean speech recordings paired with full Yoruba orthography (including verified tonal marks and diacritics), word-level language identification tags, and rich metadata spanning 16 semantic domains and 7 emotion categories. The dataset is explicitly partitioned to prevent data contamination, serving as a highly stratified, robust benchmark for low-resource speech technologies.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

This dataset is published under the Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0). It is openly available and royalty-free for developers, researchers, and startups operating within developing nations to build, modify, and commercialize AI systems. However, commercial entities headquartered in developed, high-income nations must engage in a benefit-sharing agreement with LyngualLabs as stipulated by the NOODL-1.0 framework.

Forbidden Usage

You agree not to attempt to determine the identity of any speakers or contributors in this dataset. Any attempt to clone voices or train generative AI models (e.g., deepfakes, voice replication) that specifically imitate the individual speakers in this dataset is strictly forbidden. It is forbidden to use this dataset as the sole training source for high-stakes, safety-critical systems (e.g., medical diagnoses, legal transcription, or emergency routing systems) where zero-error tolerance is required. The dataset does not cover trilingual mixing (e.g., Nigerian Pidgin) and should not be misconstrued as a general West African multilingual benchmark.

Processes

Ethical Review

This dataset was curated using an ethical "data farming" framework prioritizing community reciprocity. Contributors were actively recruited from local Nigerian communities, fairly compensated for their time and expertise, and provided with capacity-building training. A strict daily recording cap was enforced to protect speaker vocal well-being. Prior to participation, all contributors provided explicit, informed consent for their data to be collected, used, and hosted publicly on the Mozilla Data Collective for global research. Before release, the dataset underwent a rigorous "PII Scrub" to permanently remove all personally identifiable information, and a final transparency notice was issued to contributors to ensure compliance with the Nigerian Data Protection Act (NDPA 2023).

Intended Use

This dataset is intended for training and evaluating robust Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) systems specifically designed for low-resource, code-switched languages. Furthermore, it serves as a benchmark for downstream NLP tasks such as word-level Language Identification (LID) and Speech Emotion Recognition (SER), and provides a rich resource for academic linguistic research analyzing high-density code-switching and complex tonal/prosodic interactions.

Metadata

DATASET OVERVIEW & DEMOGRAPHICS

The YECS Corpus provides ~120 hours of high-quality, naturally produced intra-sentential Yoruba-English code-switched speech.

  • Total Size: 99,930 validated audio-text pairs.

  • Data Splits: The dataset is partitioned with strict prompt-level disjointness into Training (~95.5 hours), Validation (~12 hours), and Testing (~11.8 hours). To prevent data contamination, the Test set is released as audio-only; ground-truth annotations are retained internally by LyngualLabs for benchmarking.

  • Item Meta Each utterance includes clean speech recordings (mean SNR of 91.6 dB), full Yoruba orthography, word-level language tags, emotion labels, and domain tags.

  • Speaker Demographics: Features 140 unique speakers (95 female / 67.9%, 45 male / 32.1%). No single speaker accounts for more than 5% of the total dataset duration.

  • Language Dominance Breakdown: English-led (49.4%), Yoruba-led (38.6%), Balanced (11.9%), Monolingual (0.01%).

METADATA DISTRIBUTIONS

To ensure robust benchmarking, the dataset is highly stratified across all splits.

1. Gender Distribution

GenderOverallTrainValTest
Female94.20h (78.9%)75.42h9.47h9.31h
Male25.17h (21.1%)20.15h2.50h2.52h

2. Emotion Category Distribution

EmotionOverallTrainValTest
Neutral115.37h (96.7%)92.35h11.56h11.46h
Happy1.32h (1.1%)1.09h0.12h0.11h
Sad1.19h (1.0%)0.94h0.13h0.12h
Angry0.72h (0.6%)0.59h0.07h0.07h
Disgusted0.38h (0.3%)0.30h0.04h0.04h
Surprised0.21h (0.2%)0.17h0.03h0.01h
Fearful0.18h (0.2%)0.15h0.02h0.02h

3. Semantic Domain Distribution

DomainOverallTrainValTest
General19.56h (16.4%)15.51h2.02h2.03h
Finance13.66h (11.4%)11.13h1.24h1.28h
Entertainment11.09h (9.3%)8.86h1.19h1.03h
News8.65h (7.2%)6.97h0.89h0.79h
History8.48h (7.1%)6.82h0.84h0.82h
Education8.09h (6.8%)6.36h0.87h0.85h
Fashion7.42h (6.2%)6.06h0.67h0.70h
Law6.87h (5.8%)5.42h0.71h0.74h
Religion6.67h (5.6%)5.44h0.61h0.61h
Sports6.37h (5.3%)4.99h0.70h0.67h
Transportation6.02h (5.0%)4.79h0.63h0.59h
Health5.89h (4.9%)4.70h0.58h0.62h
Science5.76h (4.8%)4.66h0.55h0.55h
Family1.91h (1.6%)1.51h0.18h0.22h
Agriculture1.55h (1.3%)1.24h0.16h0.15h
Tourism1.40h (1.2%)1.11h0.12h0.17h

DATA COLLECTION & QUALITY PREPROCESSING

The dataset was curated using an ethical "data farming" framework designed to empower the local community. First, 50 trained bilingual writers generated 51,532 culturally grounded prompts. Linguistic experts validated these for tonal accuracy and applied language/emotion tags. Finally, 140 community speakers recorded the prompts via a device-agnostic web app.

Each prompt was recorded by three distinct speakers to ensure acoustic diversity. A mandatory record-submit-review workflow minimized clipping and background noise. Following strict quality assurance for intelligibility and tonal preservation, non-compliant samples were excluded from the initial raw collection, resulting in this final ~120-hour corpus.