Datasets

Filters:
RFE/RL

RFE/RL Ukrainian News Text Corpus

Massive 30-year Ukrainian & Russian news corpus from Radio Svoboda (1995-2026) with over 504,000 articles and 171M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: uk,ru

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 591.97 MB

CLEAR Global

Synthetic Text Corpus for African Language ASR

LLM-generated synthetic text in 10 African languages with human linguistic quality evaluations
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bm,ny,ha,kr,luo

Task Icon

Task: NLP

Format Icon

Format: TSV

Size Icon

Size: 746.63 KB

Kaler Kontho

Kaler Kantho Bengali Newspaper Corpus

A 10+ million token Bengali newspaper corpus from Kaler Kantho provided in .docx format for large-scale NLP research and linguistic analysis.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ben

Task Icon

Task: NLP

Format Icon

Format: DOCX

Size Icon

Size: 33.11 MB

CLEAR Global

Marma Text Corpus

Marma language sentences with original and normalized text forms, supporting language technology development for this Tibeto-Burman language.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: rmz

Task Icon

Task: LM

Format Icon

Format: TSV

Size Icon

Size: 188.92 KB

Protham Alo

Prothom Alo Bengali Newspaper Corpus

A 10+ million token Bengali newspaper corpus from Prothom Alo provided in .docx format for large-scale NLP research and linguistic analysis.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ben

Task Icon

Task: NLP

Format Icon

Format: DOCX

Size Icon

Size: 42.36 MB

RFE/RL

RFE/RL Uzbek News Text Corpus

Longitudinal Uzbek news corpus from Radio Ozodlik (2002-2026) with over 166,000 articles and 31M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: uz

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 154.21 MB

RFE/RL

RFE/RL Romanian (Romania) News Text Corpus

Longitudinal Romanian news corpus from Europa Liberă România (2013-2026) with over 34,000 articles and 17.6M tokens.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ro

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 77.95 MB

Community

Hindi 10 Million Text Corpus

Hindi monolingual corpus of about 10 million tokens, compiled from multiple authors and including literary works and personal articles.
License Icon

License: CC-BY-ND-4.0

Locale Icon

Locale: hin

Task Icon

Task: NLP

Format Icon

Format: DOCX

Size Icon

Size: 21.91 MB

Jamuna Printing and Publishing Ltd.

The Daily Jugantor Bengali Language Corpus

Monolingual Bengali corpus of 10.6 million words from Daily Jugantor, a major Bengali news source.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ben

Task Icon

Task: NLP

Format Icon

Format: DOCX

Size Icon

Size: 40.49 MB

Community

CV Korean Test 25.0 - Noise-Augmented (SCAI)

A noise-augmented version of the Mozilla Common Voice Korean test split for robust ASR evaluation under realistic acoustic conditions.
License Icon

License: CC0-1.0

Locale Icon

Locale: ko

Task Icon

Task: ASR

Format Icon

Format: MP3, JSONL

Size Icon

Size: 21.01 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Literature Corpus

A Torwali literature corpus (~233K tokens) covering poetry, folklore, biographies, and cultural texts for linguistic research and NLP development.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: trw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 488.12 KB

Institute of African Digital Humanities

Bulu_ALCAM-MultimodalDataset

Bulu ALCAM multimodal dataset: lexical entries and example sentences in Bulu (IPA) with French equivalents, audio recordings, and alignment file.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bum

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 31.28 MB

Institute of African Digital Humanities

Hausa-TTS-Dataset

This dataset consists of segmented Hausa speech audio clips paired with text, totalling 5h 25m and 38s.
License Icon

License: NOODL-1.0

Locale Icon

Locale: hau

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 276.90 MB

MirasAI

Tamil Time Aligned Speech Dataset

5-hour Tamil speech dataset with time-aligned transcripts, designed for ASR, forced alignment, subtitle generation, and speech-language research.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: tam

Task Icon

Task: ASR

Format Icon

Format: OGG, SRT

Size Icon

Size: 37.11 MB

MDC Curators

ViQua² — Visual Question-answering about Quantities

Multimodal evaluation dataset for quantity-based visual question answering.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: en-US

Task Icon

Task: CV

Format Icon

Format: JSON, JPEG

Size Icon

Size: 281.05 MB

Institute of African Digital Humanities

Bamun-TTS-Dataset

This dataset consists of segmented Bamun (Shupamem) speech audio clips paired with text, designed for Text-to-Speech (TTS) applications.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bax

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 219.97 MB

GriôTech

Territórios Digitais

Dataset on community-driven responses to disinformation and AI in marginalized territories in Brazil, based on participatory research.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: pt, en

Task Icon

Task: N/A

Format Icon

Format: DOCX, PDF, XLSX

Size Icon

Size: 4.24 MB

Taruen

Chuvash TTS

A ~5-hour speech dataset for Chuvash Text-to-Speech (TTS) research, featuring a single female speaker reading news and digits at a rapid tempo.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: cv

Task Icon

Task: TTS

Format Icon

Format: PARQUET

Size Icon

Size: 854.02 MB

RFE/RL

RFE/RL Persian News Text Corpus

This dataset is a longitudinal news corpus for the Persian language sourced from Radio Farda from 2001 to 2026. It contains over 350,000 articles (51M tokens).
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: fa

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 307.78 MB

MirasAI

Saraiki 10 Hours TTS Dataset

A 10-hour Saraiki text-to-speech dataset consisting of recorded speech and aligned transcripts, designed for speech synthesis research and development.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: srk

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 584.44 MB

MirasAI

Kannada Time Aligned Speech Corpus

A 5-hour Kannada speech dataset with time-aligned transcriptions, designed for ASR, forced alignment, and speech research.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: kan

Task Icon

Task: ASR

Format Icon

Format: OGG, SRT

Size Icon

Size: 355.77 MB

MDC Curators

Sentence translation difficulty in Spanish - BOUQuET

A collection of sentences in Spanish from the BOUQuET benchmark (total 1990 sentences) which have been annotated with sentence translation difficulty scores.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: es

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 81.48 KB

Institute of African Digital Humanities

Yezoum_ALCAM-MultimodalDataset

This dataset comprises aligned audio and text data in Yezoum with French equivalents.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 12.81 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Serian Bidayuh

A collection of spontaneous responses to questions in Serian Bidayuh.
License Icon

License: CC0-1.0

Locale Icon

Locale: sdo

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 201.26 MB