AIRI AI4TALK Evenki ASR

Description

This is an adapted and reorganized Evenki data from AIRI's AI4TALK competition. It is intended for automatic speech recognition: each row identifies a speech segment in an MP3 file and provides an IPA-like transcription. Audio files are stored in `audio/`, and `asr.csv` references them with relative paths. The original AI4TALK language code and recommended MDC locale are both `evn`. Evenki is a Northern Tungusic language spoken in eastern Russia and China, with roughly 17,000 native speakers according to Wikipedia's current infobox.

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

Users must comply with the Creative Commons Attribution-ShareAlike 4.0 International license terms, including attribution and share-alike.

Technical summary: this package contains asr.csv with 3,879 segment-level rows and 1,387 MP3 audio files in audio/. The CSV columns are id, start, end, source, lang, and transcription. The license is Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

Source and provenance: this is an adapted and reorganized subset of AI4TALK by the Artificial Intelligence Research Institute (AIRI), original URL https://github.com/AIRI-Institute/AI4TALK. The Evenki material comes from Institute of Linguistics RAS / Minority Languages of Russia expedition and corpus materials; corpus access and project context are available at https://gisly.net/corpus/ and https://minlang.iling-ran.ru/corpora/evenki.

Transcription and conventions: AIRI describes the ASR target as IPA, the International Phonetic Alphabet; this is a phonetic transcription target and should not be treated as ordinary standard orthography. Conventional Evenki counterparts of the IPA data provided can be found in the Evenki tier of the corpus at https://gisly.net/corpus/.

Sample row:

id,start,end,source,lang,transcription
1574,40.62300000000001,42.404,audio/7378.mp3,evn,laŋilwərtawutt͡ʃəŋkiːtin

Description

Specifics

Considerations

Processes

Metadata