Task: ASR
Release Date: 5/27/2026
Format: WAV
Size: 52.63 GB
Share
Audio corpus of Orizaba-Zongolica Nahuatl language (Glottocode:oriz1235) with a total duration of approximately **122:02:38** (hours:mins:secs).
Licensing
Creative Commons Attribution No Derivatives 4.0 International (CC-BY-ND-4.0)
https://spdx.org/licenses/CC-BY-ND-4.0.htmlRestrictions/Special Constraints
Derivative works, though encouraged, are not permitted without express consent of the dataset owner, Jonathan Amith.
Forbidden Usage
N/A
Intended Use
This dataset is intended as an archive of linguistic documentation materials and as a data source for speech and language technologies for Nahuatl.
This corpus was created with the financial help of a National Science Foundation, Dynamic Language Infrastructure collaborative grant to Jonathan D. Amith, PI, with Gettysburg College as the lead institution. The grant is #2123578: Collaborative Research: Improving Techniques of Automatic Speech Recognition and Transfer Learning using Documentary Linguistic Corpora. The other half of the collaborative grant had Shinji Watanabe as PI at Carnegie Mellon University (#2123624).
Amith, Jonathan D., Bernarda Panzo Tezoco, Gabriela Citlahua Zepahua, Amelia Domínguez Alcántara, and Ceferino Salgado Castañeda. 2026. Corpus of spoken Nahuatl from the municipalities of Atlahuilco, Rafael Delgado, Tequela, and Zongolica, state of Veracruz, with transcriptions, translations, and annotations. Downloaded from Mozilla Data Collective on yyyy-mm-dd.
Although the archived version of this corpus is CC-BY-ND, the purpose of this license is simply to ensure that any use and derivatives of this corpus adhere to ethical standards for the use of Indigenous language material, including the recognition of authors, the native speakers who have generously shared their language and culture with the team that recorded, transcribed, translated, and otherwise annotated these audio files. Native speakers are experts and teachers who have agreed to share their knowledge with those who made the recordings, with their community, their schools, and the general public. This is formally acknowledged. A good set of best practices for treatment of this knowledge is found at Guidelines for Respecting Cultural Knowledge, which is a detailed document of protocols to follow that respect Indigenous linguistic and cultural knowledge.
Note specifically the requirement that the researcher should "Identify all primary contributors and secondary sources for a particular document, and share the authorship whenever possible." Note also the request that all custodians of local knowledge be "identified" and be considered "co-authors". These are the protocols that we follow for joint work with native experts. Anonymity is of course possible if the speaker so requests it. But in 25 years of our work in Indigenous communities, no speaker has ever requested anonymity and, indeed, once the goals of the project are explained, all have enthusiastically accepted their public role of custodians and teachers.
Considering the above, all efforts will be made to share the corpus with others who might want to create derivatives to enhance educational and research goals. This will be decided very quickly on a case-by-case basis to ensure compliance with ethical practices. Please contact Jonathan D. Amith at nahuatl.biology@gmail.com.
This corpus comprises 665 audio files of the Orizaba-Zongolica Nahuatl language (Glottocode:oriz1235) with a total duration of approximately 122:02:38 (hours:mins:secs). The recordings cover the following villages in the municipalities of Atlahuilco, Rafael Delgado, Tehuipango, Tequila, and Zongolica (numbers in parentheses represent the number of audio files from each community):
Municipality of Atlahuilco — 35 recordings
Municipality of Rafael Delgado — 18 recordings
Municipality of Tehuipango — 1 recording
Municipality of Tequila — 533 recordings
Municipality of Zongolica — 79 recordings
Within each municipality, recordings were made in the following communities and barrios:
Atlahuilco: Xibtla (all 35 recordings)
Rafael Delgado: Barrio Primero (8); Quinto Barrio (9); Centro (1) (note that one recording includes as a second speaker, Florinda Calihua Zoquitecatl (FCZ560), a young woman from the municipality of Tehuipango)
Tequila: Barrio Tecuanca (77); La Cumbre (38); Número Diez (12); Santa Cruz (86); Teotzacualco (159); Tepapalotla (5); Tequila centro (156)
Zongolica: Ixpaluca (all 79 recordings)
A total of 37 speakers contributed to the corpus:
Florinda Calihua Zoquitecatl; María Aurelia Chicahua Zitlahua; Gabriela Citlahua Zepahua; Catalina Cocotle Cocotle; Ángela Cocotle Tezoco; Concepción Isabel Colohua Coquehua; Juana Hernández Carrasco; Rogelio Hernández Rojas; Angélica Hernández Vázquez; Ángela Ixmatlahua Xocua; Eugenio Marcelino Rosa; Lourdes Marcelino Santiago; Magdalena Mazahua Cuicahua; Clemente Mazahua Ixmatlahua; Martha Merino Luisa; Cecilia Miquixtle González; María del Carmen Montalvo Itehua; Teresa Otlehua Apale; Eduardo Otlehua; José Félix Panzo Maldonado; Bernarda Panzo Tezoco; María de la Luz Santiago Próspero; Reina Santiago Próspero; Matea Tezoco Sánchez; Juana Tlaxca Ixmatlahua; Rosario Tlaxcala Xicalhua; Rosa Tzitzihua Coyohua; María Cristina Tzitzihua Cuanecuilco; Francisca Tzontehua Tlaxcala; Guadalupe Tzopitl Montalvo; José Eleno Tzopitl; Reina Xicalhua Tlaxcala; Magdaleno Xocua Tehuintle; José Domingo Xocua Texcahua; María Rosaria Xocua Tlecuile; Antonia Xotlanihua Cozcahua; Ernesto Zepahua Colohua.
Included in the list are two native speakers, Gabriela Citlahua Zepahua and Bernarda Panzo Tezoco, who led local research in Tequila and Zongolica, respectively, recording and interviewing as evidenced in the metadata for the audio recordings. Bernarda Panzo has continued with the project and been instrumental in our understanding of the language and in the transcriptions of 56 audio recordings included in this first deposit.
In addition, two native speakers, Amelia Domínguez Alcántara and Ceferino Salgado Castañeda, both from the municipality of Cuetzalan del Progreso (classified within a distinct Nahuatl language: Glottolog high1234), recorded most of the corpus along with Jonathan D. Amith.
A total of 13 hours 6 minutes of transcriptions were produced from 76 audio files with this first deposit in January 2026. As work progresses, the transcriptions may be edited, and more audio will be transcribed. Plans are also in place for adding free translations and other annotations. New versions of the transcriptions, translations, and annotations will be uploaded as they are created.
The original plan to document the Nahuatl of central Veracruz (municipality of Tequila) was carried out with the key help of Gabriela Citlahua Zepahua, who was recommended to Amith by Magnus Pharao Hansen, with whom she had previously worked. Amith visited the area in February and early March 2022, accompanied by Amelia Domínguez Alcántara and Ceferino Salgado Castañeda. Domínguez and Salgado continued to record through March 2022, until 1 April of that year. They returned to recording in September and October 2022 and then again in March 2023. An additional 81 field botany recordings were made by Miriam Jiménez Chimil and Mariano Gorostiza Salazar from 7 to 16 March 2023.
Amith, Domínguez, and Salgado recorded with a Sound Devices 702 recorder and Countryman E6 ear-worn omnidirectional microphones. All 76 transcriptions covering 13 hours 6 minutes were done by the research team: B. Panzo, J. Amith, A. Domínguez, and C. Salgado. Also integral to the transcription of Orizaba-Zongolica Nahuatl was Ángeles Márquez Hernández, from San Miguel Tenango, municipality of Zacatlán de las Manzanas, Puebla.
At the time of archiving (May 2026), a total of 76 files with a duration of 13 hours 6 minutes had been transcribed. The trancriptions are released as a separate MDC dataset. The transcriptions were done by Jonathan Amith, Bernarda Panzo, and Ángeles Márquez. Subsequent segmentation of all unique word forms is presently being carried out by Amith, Panzo, Márquez, Domínguez, Salgado, and Jeremías Cabrera Ortiz, a native Nahuatl speaker from San Agustín Oapan, Guerrero, who recently joined the research team. The transcriptions have been carefully reviewed and will be improved once the team begins to segment all unique word forms, a process that brings out inconsistencies and errors. When complete, this set of reviewed transcriptions will be placed in this deposit as a new version.
The next step is to use the 13+ hours as a training corpus for automatic speech recognition, and to transfer the ASR recipe in ESPNet that had been built from a larger transcribed corpus of Nahuatl from the municipality of Cuetzalan del Progreso. The ASR output will be corrected by Bernarda Panzo and others on the research team until all 114.25 hours have been corrected by human effort.
Once the previous step has been completed, the following years will be dedicated to free translation and, hopefully, morphological segmentation and glossing.