Datasets

Filters:
MDC Community Concierge

IsiZulu Second Language Learner Speech Corpus

Gold standard recordings from isiZulu teachers and recordings from L2 learners annotated by isiZulu teachers for phonemic and tonal pronunciation errors .
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: zu

Task Icon

Task: CALL

Format Icon

Format: WAV, SQLITE

Size Icon

Size: 5.26 GB

EELLAK - GreekFOSS

Modern Greek Dictionary

This dataset is a structured digital export of the Triantafyllides Modern Greek Dictionary.
License Icon

License: CC BY-NC-ND 4.0

Locale Icon

Locale: gr-GR

Task Icon

Task: NLP

Format Icon

Format: PARQUET

Size Icon

Size: 12.02 MB

EELLAK - GreekFOSS

ERT Press

Structured digital collection of press releases and news articles from the official platform of ERT, Greece's national public broadcaster.
License Icon

License: CC BY-NC-ND 4.0

Locale Icon

Locale: gr-GR

Task Icon

Task: NLP

Format Icon

Format: PARQUET

Size Icon

Size: 32.60 MB

Community

Ladino-Spanish Lexical Resources

Dictionary, word list, and verb conjugation files for Ladino (Judeo-Spanish) and Spanish, compiled for a rule-based translation system.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: lad, spa

Task Icon

Task: MT

Format Icon

Format: TXT

Size Icon

Size: 39.92 KB

Institute of African Digital Humanities

Yoruba-TTS-Dataset

This dataset consists of segmented Yoruba speech audio clips paired with text, designed for Text-to-Speech (TTS) applications.
License Icon

License: NOODL-1.0

Locale Icon

Locale: yor

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 319.05 MB

Community

Şalom Ladino Corpus

Text corpus of 176,843 words compiled from 397 Judeo-Espanyol articles from Şalom newspaper.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: lad

Task Icon

Task: LM

Format Icon

Format: TXT

Size Icon

Size: 403.16 KB

Community

Ladino: Una Fraza al Diya

307 Ladino (Judeo-Spanish) language learning sentences with translations in Turkish, English, and Spanish, plus audio recordings and images.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: lad

Task Icon

Task: NLP

Format Icon

Format: OGG, JPEG, TSV

Size Icon

Size: 76.35 MB

Kaltepetlahtol

Imágenes de Señalamientos en México

Una colección de imágenes anotadas de señales de tránsito y otras señales viales en México
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: es

Task Icon

Task: CV

Format Icon

Format: JPEG, JSON

Size Icon

Size: 2.23 GB

CLEAR Global

Kanuri Books Corpus

Randomized sentences from Kanuri-language books by four authors, containing 10,281 sentences and 90,706 words.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: kr

Task Icon

Task: LM

Format Icon

Format: TXT

Size Icon

Size: 545.68 KB

MDC Curators

LibriVox Italian TTS Female Voice

4 hours of sentence-aligned speech/text from "Le avventure di Pinocchio" by Carlo Collodi
License Icon

License: CC0-1.0

Locale Icon

Locale: it

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 61.74 MB

MDC Curators

LibriVox Czech TTS Female Voice

2 hours of sentence-aligned speech/text from "Krysař" by Viktor Dyk
License Icon

License: CC0-1.0

Locale Icon

Locale: cs

Task Icon

Task: TTS

Format Icon

Format: MP3, TXT, TSV

Size Icon

Size: 178.58 MB

MDC Curators

UK Sort Codes - ASR Evaluation

This dataset consists of 1,000 UK bank sort codes read out aloud by a single male speaker of UK English from the Midlands.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: en-GB

Task Icon

Task: ASR

Format Icon

Format: WEBM, TSV

Size Icon

Size: 23.76 MB

Community

otomí-hñähñu TTS Voz Masculina

4.5 horas de habla leída alineada con texto del hñähñu-otomí del Valle del Mezquital, Hidalgo.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: ote

Task Icon

Task: TTS

Format Icon

Format: MP3, TXT, TSV

Size Icon

Size: 119.54 MB

LyngualLabs

Yoruba-English Code-Switching (YECS) Corpus

A 120-hour corpus of high-quality, naturally produced intra-sentential Yoruba-English code-switched speech for training robust ASR and NLP systems.
License Icon

License: NOODL-1.0

Locale Icon

Locale: yo, en

Task Icon

Task: ASR

Format Icon

Format: WAV, CSV

Size Icon

Size: 9.71 GB

Community

Awal Tamazight Dataset

A compilation of monolingual and parallel Tamazight language datasets created by CIEMEN as part of the Awal community language technology project.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: zgh

Task Icon

Task: LM

Format Icon

Format: TSV, JSON, TXT

Size Icon

Size: 11.57 MB

RFE/RL

RFE/RL Serbian, Bosnian, and Montenegrin (Balkan) News Text Corpus

Longitudinal Balkan news corpus from Radio Slobodna Evropa (2003-2026) with nearly 390,000 articles and over 24M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: hbs

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 310.39 MB

RFE/RL

RFE/RL Bulgarian News Text Corpus

Longitudinal Bulgarian news corpus from Radio Svobodna Evropa (2019-2026) with over 26,000 articles and 8.3M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: bg

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 49.82 MB

RFE/RL

RFE/RL Azerbaijani News Text Corpus

Longitudinal Azerbaijani & Russian news corpus from Radio Azadlıq (2005-2026) with over 239,000 articles and 37M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: az,ru

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 211.65 MB

RFE/RL

RFE/RL Belarusian News Text Corpus

Longitudinal Belarusian news corpus from Radio Svaboda (1997-2026) with nearly 339,000 articles and 134M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: be

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 486.55 MB

RFE/RL

RFE/RL Macedonian News Text Corpus

Longitudinal Macedonian news corpus from Radio Slobodna Evropa (2002-2026) with over 204,000 articles and 46M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: mk

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 133.95 MB

MDC Curators

LibriVox Croatian TTS Male Voice

4 hours of sentence-aligned speech/text from "Priče iz Davnine" on LibriVox
License Icon

License: CC0-1.0

Locale Icon

Locale: hr

Task Icon

Task: TTS

Format Icon

Format: MP3, TXT, TSV

Size Icon

Size: 377.60 MB

RFE/RL

RFE/RL Romanian (Moldova) News Text Corpus

Longitudinal Romanian, Russian & English news corpus from Europa Liberă Moldova (2002-2026) with over 244,000 articles and 63M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ro,ru,en

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 311.87 MB

RFE/RL

RFE/RL Tajik News Text Corpus

Longitudinal Tajik & Russian news corpus from Radio Ozodi (2000-2026) with over 166,000 articles and 20M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: tg,ru

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 145.27 MB

MirasAI

Punjabi 10 Hours TTS

10-hour Punjabi TTS dataset in Shahmukhi script with paired audio and transcripts, designed for speech synthesis and Punjabi language technology.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: pnb

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 481.96 MB