Datasets

Filters:
MDC Curators

LibriVox Czech TTS Female Voice

2 hours of sentence-aligned speech/text from "Krysař" by Viktor Dyk
License Icon

License: CC0-1.0

Locale Icon

Locale: cs

Task Icon

Task: TTS

Format Icon

Format: MP3, TXT, TSV

Size Icon

Size: 178.58 MB

MDC Curators

UK Sort Codes - ASR Evaluation

This dataset consists of 1,000 UK bank sort codes read out aloud by a single male speaker of UK English from the Midlands.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: en-GB

Task Icon

Task: ASR

Format Icon

Format: WEBM, TSV

Size Icon

Size: 23.76 MB

Community

otomí-hñähñu TTS Voz Masculina

4.5 horas de habla leída alineada con texto del hñähñu-otomí del Valle del Mezquital, Hidalgo.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: ote

Task Icon

Task: TTS

Format Icon

Format: MP3, TXT, TSV

Size Icon

Size: 119.54 MB

LyngualLabs

Yoruba-English Code-Switching (YECS) Corpus

A 120-hour corpus of high-quality, naturally produced intra-sentential Yoruba-English code-switched speech for training robust ASR and NLP systems.
License Icon

License: NOODL-1.0

Locale Icon

Locale: yo, en

Task Icon

Task: ASR

Format Icon

Format: WAV, CSV

Size Icon

Size: 9.71 GB

Community

Awal Tamazight Dataset

A compilation of monolingual and parallel Tamazight language datasets created by CIEMEN as part of the Awal community language technology project.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: zgh

Task Icon

Task: LM

Format Icon

Format: TSV, JSON, TXT

Size Icon

Size: 11.57 MB

RFE/RL

RFE/RL Serbian, Bosnian, and Montenegrin (Balkan) News Text Corpus

Longitudinal Balkan news corpus from Radio Slobodna Evropa (2003-2026) with nearly 390,000 articles and over 24M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: hbs

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 310.39 MB

RFE/RL

RFE/RL Bulgarian News Text Corpus

Longitudinal Bulgarian news corpus from Radio Svobodna Evropa (2019-2026) with over 26,000 articles and 8.3M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: bg

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 49.82 MB

RFE/RL

RFE/RL Azerbaijani News Text Corpus

Longitudinal Azerbaijani & Russian news corpus from Radio Azadlıq (2005-2026) with over 239,000 articles and 37M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: az,ru

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 211.65 MB

RFE/RL

RFE/RL Belarusian News Text Corpus

Longitudinal Belarusian news corpus from Radio Svaboda (1997-2026) with nearly 339,000 articles and 134M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: be

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 486.55 MB

RFE/RL

RFE/RL Macedonian News Text Corpus

Longitudinal Macedonian news corpus from Radio Slobodna Evropa (2002-2026) with over 204,000 articles and 46M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: mk

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 133.95 MB

MDC Curators

LibriVox Croatian TTS Male Voice

4 hours of sentence-aligned speech/text from "Priče iz Davnine" on LibriVox
License Icon

License: CC0-1.0

Locale Icon

Locale: hr

Task Icon

Task: TTS

Format Icon

Format: MP3, TXT, TSV

Size Icon

Size: 377.60 MB

RFE/RL

RFE/RL Romanian (Moldova) News Text Corpus

Longitudinal Romanian, Russian & English news corpus from Europa Liberă Moldova (2002-2026) with over 244,000 articles and 63M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ro,ru,en

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 311.87 MB

RFE/RL

RFE/RL Tajik News Text Corpus

Longitudinal Tajik & Russian news corpus from Radio Ozodi (2000-2026) with over 166,000 articles and 20M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: tg,ru

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 145.27 MB

MirasAI

Punjabi 10 Hours TTS

10-hour Punjabi TTS dataset in Shahmukhi script with paired audio and transcripts, designed for speech synthesis and Punjabi language technology.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: pnb

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 481.96 MB

RFE/RL

RFE/RL Turkmen News Text Corpus

Longitudinal Turkmen & Russian news corpus from Azatlyk Radiosy (2009-2026) with nearly 65,000 articles and 16.5M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: tk,ru

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 48.28 MB

RFE/RL

RFE/RL Kyrgyz News Text Corpus

Longitudinal Kyrgyz, Russian & English news corpus from Radio Azattyk (2002-2026) with over 352,000 articles and 79M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ky,ru,en

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 282.41 MB

RFE/RL

RFE/RL Georgian News Text Corpus

Longitudinal Georgian news corpus from Radio Tavisupleba (2001-2026) with over 238,000 articles and 38M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ka

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 257.53 MB

RFE/RL

RFE/RL Kazakh News Text Corpus

Longitudinal Kazakh news corpus from Radio Azattyq (2003-2026) with over 139,000 articles and 35M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: kk

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 126.81 MB

RFE/RL

RFE/RL Crimean Tatar News Text Corpus

Longitudinal Crimean Tatar news corpus from Qırım.Aqiqat (2014-2026) with over 32,000 articles and 7.5M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: crh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 18.35 MB

RFE/RL

RFE/RL Chechen News Text Corpus

Longitudinal Chechen news corpus from Radio Marsho (2006-2026) with over 30,000 articles and 8.4M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ce

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 28.29 MB

Institute of African Digital Humanities

Naija-TTS-Dataset

This dataset consists of Nigerian Pidgin English (Naija) audio clips paired with text, designed for TTS applications.
License Icon

License: NOODL-1.0

Locale Icon

Locale: pcm

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 324.82 MB

RFE/RL

RFE/RL Hungarian News Text Corpus

Complete historical Hungarian news corpus from Szabad Európa (2020-2025) with over 18,000 articles and 12.4M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: hu

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 36.64 MB

RFE/RL

RFE/RL Ukrainian (Crimea) News Text Corpus

Longitudinal Ukrainian news corpus from Krym.Realii (2012-2026) with over 162,000 articles and 54M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: uk

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 180.13 MB

RFE/RL

RFE/RL Pashto (Pakistani) News Text Corpus

Longitudinal Pakistani Pashto news corpus from Radio Mashaal (2010-2026) with over 52,000 articles and 14.5M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ps

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 39.26 MB