Datasets

Filters:
Community

CV Korean Test 25.0 - Noise-Augmented (SCAI)

A noise-augmented version of the Mozilla Common Voice Korean test split for robust ASR evaluation under realistic acoustic conditions.
License Icon

License: CC0-1.0

Locale Icon

Locale: ko

Task Icon

Task: ASR

Format Icon

Format: MP3, JSONL

Size Icon

Size: 21.01 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Literature Corpus

A Torwali literature corpus (~233K tokens) covering poetry, folklore, biographies, and cultural texts for linguistic research and NLP development.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: trw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 488.12 KB

Institute of African Digital Humanities

Bulu_ALCAM-MultimodalDataset

Bulu ALCAM multimodal dataset: lexical entries and example sentences in Bulu (IPA) with French equivalents, audio recordings, and alignment file.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bum

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 31.28 MB

Institute of African Digital Humanities

Hausa-TTS-Dataset

This dataset consists of segmented Hausa speech audio clips paired with text, totalling 5h 25m and 38s.
License Icon

License: NOODL-1.0

Locale Icon

Locale: hau

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 276.90 MB

MirasAI

Tamil Time Aligned Speech Dataset

5-hour Tamil speech dataset with time-aligned transcripts, designed for ASR, forced alignment, subtitle generation, and speech-language research.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: tam

Task Icon

Task: ASR

Format Icon

Format: OGG, SRT

Size Icon

Size: 37.11 MB

MDC Curators

ViQua² — Visual Question-answering about Quantities

Multimodal evaluation dataset for quantity-based visual question answering.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: en-US

Task Icon

Task: CV

Format Icon

Format: JSON, JPEG

Size Icon

Size: 281.05 MB

Institute of African Digital Humanities

Bamun-TTS-Dataset

This dataset consists of segmented Bamun (Shupamem) speech audio clips paired with text, designed for Text-to-Speech (TTS) applications.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bax

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 219.97 MB

GriôTech

Territórios Digitais

Dataset on community-driven responses to disinformation and AI in marginalized territories in Brazil, based on participatory research.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: pt, en

Task Icon

Task: N/A

Format Icon

Format: DOCX, PDF, XLSX

Size Icon

Size: 4.24 MB

Taruen

Chuvash TTS

A ~5-hour speech dataset for Chuvash Text-to-Speech (TTS) research, featuring a single female speaker reading news and digits at a rapid tempo.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: cv

Task Icon

Task: TTS

Format Icon

Format: PARQUET

Size Icon

Size: 854.02 MB

RFERL

RFE/RL Persian News Text Corpus

This dataset is a longitudinal news corpus for the Persian language sourced from Radio Farda from 2001 to 2026. It contains over 350,000 articles (51M tokens).
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: fa

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 307.78 MB

MirasAI

Saraiki 10 Hours TTS Dataset

A 10-hour Saraiki text-to-speech dataset consisting of recorded speech and aligned transcripts, designed for speech synthesis research and development.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: srk

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 584.44 MB

MirasAI

Kannada Time Aligned Speech Corpus

A 5-hour Kannada speech dataset with time-aligned transcriptions, designed for ASR, forced alignment, and speech research.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: kan

Task Icon

Task: ASR

Format Icon

Format: OGG, SRT

Size Icon

Size: 355.77 MB

MDC Curators

Sentence translation difficulty in Spanish - BOUQuET

A collection of sentences in Spanish from the BOUQuET benchmark (total 1990 sentences) which have been annotated with sentence translation difficulty scores.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: es

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 81.48 KB

Institute of African Digital Humanities

Yezoum_ALCAM-MultimodalDataset

This dataset comprises aligned audio and text data in Yezoum with French equivalents.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 12.81 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Serian Bidayuh

A collection of spontaneous responses to questions in Serian Bidayuh.
License Icon

License: CC0-1.0

Locale Icon

Locale: sdo

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 201.26 MB

Common Voice

Common Voice Scripted Speech 25.0 - Pashto

A collection of read speech recordings in Pashto.
License Icon

License: CC0-1.0

Locale Icon

Locale: ps

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 97.81 GB

Common Voice

Common Voice Scripted Speech 25.0 - English

A collection of read speech recordings in English.
License Icon

License: CC0-1.0

Locale Icon

Locale: en

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 87.84 GB

Common Voice

Common Voice Scripted Speech 25.0 - Catalan

A collection of read speech recordings in Catalan.
License Icon

License: CC0-1.0

Locale Icon

Locale: ca

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 78.67 GB

Institute of African Digital Humanities

Bamun-French Parallel Corpus 2.0

This dataset is an extended and updated version of the "Bamun-French Parallel Corpus 1.1", a parallel corpus of 4,444 lines in Bamun and French.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bax

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 184.29 KB

Common Voice

Common Voice Scripted Speech 25.0 - Kinyarwanda

A collection of read speech recordings in Kinyarwanda.
License Icon

License: CC0-1.0

Locale Icon

Locale: rw

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 57.18 GB

Common Voice

Common Voice Scripted Speech 25.0 - French

A collection of read speech recordings in French.
License Icon

License: CC0-1.0

Locale Icon

Locale: fr

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 28.39 GB

Common Voice

Common Voice Scripted Speech 25.0 - Spanish

A collection of read speech recordings in Spanish.
License Icon

License: CC0-1.0

Locale Icon

Locale: es

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 48.23 GB

Community

Araina Text Corpus (Occitan Aranese)

Text corpus in Aranese variety of Gascon dialect of Occitan
License Icon

License: CC0-1.0

Locale Icon

Locale: oc

Task Icon

Task: LM

Format Icon

Format: txt

Size Icon

Size: 22.97 MB

Common Voice

Common Voice Scripted Speech 25.0 - Belarusian

A collection of read speech recordings in Belarusian.
License Icon

License: CC0-1.0

Locale Icon

Locale: be

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 36.21 GB