Task: ASR
Release Date: 6/25/2026
Format: CSV, MP3
Size: 572.87 MB
Share
This is an adapted and reorganized Evenki data from AIRI's AI4TALK competition. It is intended for automatic speech recognition: each row identifies a speech segment in an MP3 file and provides an IPA-like transcription. Audio files are stored in `audio/`, and `asr.csv` references them with relative paths. The original AI4TALK language code and recommended MDC locale are both `evn`. Evenki is a Northern Tungusic language spoken in eastern Russia and China, with roughly 17,000 native speakers according to Wikipedia's current infobox.
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlRestrictions/Special Constraints
Users must comply with the Creative Commons Attribution-ShareAlike 4.0 International license terms, including attribution and share-alike.
Forbidden Usage
Users must not use the data to reveal or somehow decipher any personal information about the speakers or those who has contributed the source language data.
Ethical Review
The dataset was part of an already completed competition, raising no ethical problems at that time.
Intended Use
This dataset is intended for automatic speech recognition.
Technical summary: this package contains asr.csv with 3,879 segment-level rows and 1,387 MP3 audio files in audio/. The CSV columns are id, start, end, source, lang, and transcription. The license is Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
Source and provenance: this is an adapted and reorganized subset of AI4TALK by the Artificial Intelligence Research Institute (AIRI), original URL https://github.com/AIRI-Institute/AI4TALK. The Evenki material comes from Institute of Linguistics RAS / Minority Languages of Russia expedition and corpus materials; corpus access and project context are available at https://gisly.net/corpus/ and https://minlang.iling-ran.ru/corpora/evenki.
Transcription and conventions: AIRI describes the ASR target as IPA, the International Phonetic Alphabet; this is a phonetic transcription target and should not be treated as ordinary standard orthography. Conventional Evenki counterparts of the IPA data provided can be found in the Evenki tier of the corpus at https://gisly.net/corpus/.
Sample row:
id,start,end,source,lang,transcription
1574,40.62300000000001,42.404,audio/7378.mp3,evn,laŋilwərtawutt͡ʃəŋkiːtin