Task: ASR
Release Date: 6/29/2026
Format: FLAC, TSV
Size: 875.17 MB
Share
An ASR dataset ofTequila-Zongolica (aka Orizaba, Central Veracruz) Nahuatl, ISO 639-3 nlv. This is a derivative work of the Tequila-Zongolica Nahuatl Audio and Transcriptions datasets. It consists of the subset of larger audio dataset with transcriptions (approximately 16 hours) converted to the Mozilla Common Voice Scripted Speech format. The original stereo audio has been split and aligned with the parsed transcriptions.
Licensing
Creative Commons Attribution No Derivatives 4.0 International (CC-BY-ND-4.0)
https://spdx.org/licenses/CC-BY-ND-4.0.htmlRestrictions/Special Constraints
This derivative dataset is distributed with the permission of the original authors. It maintains the same license as the source material.
Forbidden Usage
NA
Intended Use
This dataset is specifically formatted to facilitate ASR model-training and evaluation.
This is a derivative work of Amith et al (2026)'s Tequila Zongolica Nahuatl Audio dataset and Tequila Zongolica Nahuatl Transcriptions dataset, optimized for ASR training and evaluation.
This corpus contains 16 hours of speech and transcriptions of Nahuatl-speakers from the municipalities of Tequila and Zongolica, state of Veracruz, Mexico. The specific Nahuatl variety is often referred to as "Orizaba" or "Central Veracruz" Nahuatl (Spanish: "Náhuatl central de Veracruz"). It's ISO 639-3 code is "nlv".
The original, full-length audio files that had corresponding transcriptions were segmented based on the transcription timestamps, with each channel corresponding to the appropriate speaker (in cases where there are two speakers). The segmented audio were output as .flac format. The original transcriptions are available, as well as an optional "normalized" version (which removes vowel-length marking and metalinguistic information (such as asterisks indicating that a word is a Spanish loan). Data splits were selected to ensure no speaker overlap.
The dataset has been formatted to match the Mozilla Common Voice Scripted Speech datasets. There are three tsv files corresponding to the randomly generated data splits: "train.tsv", "dev.tsv", and "test.tsv". Each utterance has a corresponding audio file, and all audio files are in the clips/ directory. Each tsv file has the following columns:
| Column Name | Description |
|---|---|
| audio | The name of the specific audio segment file. |
| original_audio | The corresponding full-length audio file from the original dataset. |
| original_transcription | The corresponding .trs file from the original transcription dataset. |
| speaker | The unique identifier (ID) for the speaker. |
| start | The starting timestamp within the original audio file. |
| stop | The ending timestamp within the original audio file. |
| transcription | The raw text of what was spoken. |
| normalized | A normalized form of the transcription, normalizing disfluencies and |
| split | The dataset partition (e.g., train, dev, or test). |
The train split has 24 speakers, the dev split has 6 speakers, and the test split has 5 speakers. Speaker information can be consulted in the original Tequila Zongolica Nahuatl Audio dataset
Please cite both sources, original and licensed derivative, if using Pugh 2026.
Pugh, Robert. 2026. Tequila Zongolica Nahuatl ASR-Ready Corpus. Derived from Amith, Panzo, Citlahua, Domínguez, Salgado, and Salgado (2026).
Amith, Jonathan D., Bernarda Panzo Tezoco, Gabriela Citlahua Zepahua, Amelia Domínguez Alcántara, and Ceferino Salgado Castañeda. 2026. Corpus of spoken Nahuatl from the municipalities of Atlahuilco, Rafael Delgado, Tequela, and Zongolica, state of Veracruz, with transcriptions, translations, and annotations. Downloaded from Mozilla Data Collective on yyyy-mm-dd.
This dataset is a licensed derivative work (Amith 2026-02-12). To ensure proper credit is given to the original linguists and community members who recorded and transcribed this data, all publications using this version must cite both the primary source and this ASR-ready derivative work (see above). The foundational scholarship, field recordings, and transcriptions were produced by Amith, Panzo, Citlahua, Domínguez, Salgado, and Salgado (2026).
"We trained our models using the ASR-optimized version of the Tequila Zongolica Nahuatl corpus (Amith et al. 2026; Pugh 2026)."
This derivative dataset is distributed with the permission of the original authors. It maintains the same license terms as the source material.