AIRI AI4TALK Meadow Mari ASR

Description

This is an adapted and reorganized Meadow Mari data from AIRI's AI4TALK competition. It is intended for automatic speech recognition: each row identifies a speech segment in an MP3 file and provides a transcription in Meadow Mari Cyrillic orthography taken from the HSE/LingConLab Spoken Meadow Mari corpus. The total length of audio content is 1.29 h. Audio files are stored in `audio/`, and `asr.csv` references them with relative paths. The corpus reflects natural speech and includes frequent Russian code-switching (written in standard Russian Cyrillic). The original AI4TALK language code and recommended MDC locale are both `mhr`. Meadow Mari, also known as Meadow-Eastern Mari, is a Uralic Mari language used mostly in European Russia and Mari El, with roughly 470,000 native speakers according to Wikipedia's current infobox.

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Technical summary: this package contains asr.csv with 2,428 segment-level rows and the 533 MP3 audio files they reference in audio/. The CSV columns are id, start, end, source, lang, and transcription. The license is Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). The earlier automated silver columns (orthography_silver, translit_status) have been removed, superseded by the corpus's own Cyrillic orthography. 93 rows from the AI4TALK export were dropped (77 untimed single-word clips, 2 with no matching corpus interval, 14 whose corpus text was pure annotation).

Source and provenance: this is an adapted and reorganized subset of AI4TALK by the Artificial Intelligence Research Institute (AIRI), original URL https://github.com/AIRI-Institute/AI4TALK. The Meadow Mari material comes from the HSE/LingConLab Spoken Meadow Mari corpus; source data and corpus context are available at https://github.com/LingConLab/data_oral_meadow-mari_corpus and http://lingconlab.ru/spoken_meadow_mari/.

Transcription and conventions: the transcription column is in Meadow Mari Cyrillic orthography as transcribed in the HSE/LingConLab Spoken Meadow Mari corpus — not a phonetic/IPA target. Conventions are documented at http://lingconlab.ru/spoken_meadow_mari/.

Alphabet: Meadow Mari Cyrillic (36 letters) — а б в г д е ё ж з и й к л м н ҥ о ӧ п р с т у ӱ ф х ц ч ш щ ъ ы ь э ю я. The Meadow-Mari-specific letters are ҥ, ӧ, ӱ. Russian code-switching uses the same Cyrillic letters and introduces no additional alphabet; a single stray Latin x occurs once as a corpus artifact.

Normalization (orthography re-gather): the IPA-like transcription and the prior silver columns were replaced with the orthographic sentence text of the cloned lingconlab corpus (https://github.com/LingConLab/data_oral_meadow-mari_corpus), by this pseudo-algorithm: (1) collapse the word-level lingconlab CSV to unique sentences keyed by (filename, time_start, time_end, sentence_id) → text; (2) clean each sentence — strip [speaker] tags, [ФИО] anonymization, / unclear markers, = truncation marks, -Ø zero-morphology marks, and ( ) asides, then collapse whitespace; (3) index the cleaned text by its rounded (start, end) interval (collision-free in this corpus); (4) for each asr.csv row, replace transcription with the cleaned text at the row's (start, end) and drop the silver columns; (5) sacrifice rows with no time code, no interval match, or empty-after-clean text, and recompute hours on the kept rows. Result: 2,428 of 2,521 rows kept (≈1.29 h); 93 sacrificed.

Sample row:

id,start,end,source,lang,transcription
5453,1.418,3.522,audio/11257.mp3,mhr,"Уке, мый мом налын онал."

Description

Specifics

Considerations

Processes

Metadata