License:
CC-BY-SA-4.0
Steward:
MDC Community ConciergeTask: ASR
Release Date: 4/17/2026
Format: MP3, TSV
Size: 201.79 MB
Share
A corpus of read speech by learners of English living in Mexico. The current version represents 8 speakers and makes up nearly 8 hours of recorded speech.
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlRestrictions/Special Constraints
N/A
Forbidden Usage
- You agree not to attempt to determine the identity of speakers in this dataset - Any attempt to clone the voice or train models that imitate the speakers in this dataset is forbidden
Intended Use
- ASR for English learners - Language learning applications - Linguistics research into second language learning
This dataset contains read speech recordings from 8 English language learners from Mexico.
The dataset contains a tsv file, metadata.tsv, with the following columns:
audio_id: a key with speaker_id-audio_id
speaker_id
audio_filename
sentence: text
num attempts: Speakers were asked to read the sentence as fluidly as possible, and encouraged to do retakes if they struggled during a reading. This column shows how many attempts were taken to record the sentence.
The source text consists of 1,000 sentences taken from this multilingual readability corpus. The sentences are from OpenSubtitles, and are between 1 and 10 words long.
The 8 speakers are all L2 English learners living in Mexico. All but one speak only Spanish natively (the remaining speaker is a native bilingual of Nahuatl and Spanish). Their ages are between 18-40.
Some of the speakers answered a short survey about their language experience. The answers are stored as text files in the speaker_questionnaires directory.