Tequila Zongolica Nahuatl ASR Dataset

Description

An ASR dataset ofTequila-Zongolica (aka Orizaba, Central Veracruz) Nahuatl, ISO 639-3 nlv. This is a derivative work of the Tequila-Zongolica Nahuatl Audio and Transcriptions datasets. It consists of the subset of larger audio dataset with transcriptions (approximately 16 hours) converted to the Mozilla Common Voice Scripted Speech format. The original stereo audio has been split and aligned with the parsed transcriptions.

Specifics

Licensing

Creative Commons Attribution No Derivatives 4.0 International (CC-BY-ND-4.0)

https://spdx.org/licenses/CC-BY-ND-4.0.html

Considerations

Restrictions/Special Constraints

This derivative dataset is distributed with the permission of the original authors. It maintains the same license as the source material.

Tequila-Zongolica Nahuatl ASR Corpus

This is a derivative work of Amith et al (2026)'s Tequila Zongolica Nahuatl Audio dataset and Tequila Zongolica Nahuatl Transcriptions dataset, optimized for ASR training and evaluation.

This corpus contains 16 hours of speech and transcriptions of Nahuatl-speakers from the municipalities of Tequila and Zongolica, state of Veracruz, Mexico. The specific Nahuatl variety is often referred to as "Orizaba" or "Central Veracruz" Nahuatl (Spanish: "Náhuatl central de Veracruz"). It's ISO 639-3 code is "nlv".

Processing

The original, full-length audio files that had corresponding transcriptions were segmented based on the transcription timestamps, with each channel corresponding to the appropriate speaker (in cases where there are two speakers). The segmented audio were output as .flac format. The original transcriptions are available, as well as an optional "normalized" version (which removes vowel-length marking and metalinguistic information (such as asterisks indicating that a word is a Spanish loan). Data splits were selected to ensure no speaker overlap.

Format

The dataset has been formatted to match the Mozilla Common Voice Scripted Speech datasets. There are three tsv files corresponding to the randomly generated data splits: "train.tsv", "dev.tsv", and "test.tsv". Each utterance has a corresponding audio file, and all audio files are in the clips/ directory. Each tsv file has the following columns:

Column Name	Description
audio	The name of the specific audio segment file.
original_audio	The corresponding full-length audio file from the original dataset.
original_transcription	The corresponding `.trs` file from the original transcription dataset.
speaker	The unique identifier (ID) for the speaker.
start	The starting timestamp within the original audio file.
stop	The ending timestamp within the original audio file.
transcription	The raw text of what was spoken.
normalized	A normalized form of the transcription, normalizing disfluencies and
split	The dataset partition (e.g., train, dev, or test).

The train split has 24 speakers, the dev split has 6 speakers, and the test split has 5 speakers. Speaker information can be consulted in the original Tequila Zongolica Nahuatl Audio dataset

Citation / Attribution

Please cite both sources, original and licensed derivative, if using Pugh 2026.

Pugh, Robert. 2026. Tequila Zongolica Nahuatl ASR-Ready Corpus. Derived from Amith, Panzo, Citlahua, Domínguez, Salgado, and Salgado (2026).

Amith, Jonathan D., Bernarda Panzo Tezoco, Gabriela Citlahua Zepahua, Amelia Domínguez Alcántara, and Ceferino Salgado Castañeda. 2026. Corpus of spoken Nahuatl from the municipalities of Atlahuilco, Rafael Delgado, Tequela, and Zongolica, state of Veracruz, with transcriptions, translations, and annotations. Downloaded from Mozilla Data Collective on yyyy-mm-dd.

This dataset is a licensed derivative work (Amith 2026-02-12). To ensure proper credit is given to the original linguists and community members who recorded and transcribed this data, all publications using this version must cite both the primary source and this ASR-ready derivative work (see above). The foundational scholarship, field recordings, and transcriptions were produced by Amith, Panzo, Citlahua, Domínguez, Salgado, and Salgado (2026).

Example In-Text Citation

"We trained our models using the ASR-optimized version of the Tequila Zongolica Nahuatl corpus (Amith et al. 2026; Pugh 2026)."

License Note

This derivative dataset is distributed with the permission of the original authors. It maintains the same license terms as the source material.