MDC Logo

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 470+ high-quality global datasets, built by and for the community in a transparent and ethical way.

Datasets

Community

CV Korean Test 25.0 - Noise-Augmented (SCAI)

A noise-augmented version of the Mozilla Common Voice Korean test split for robust ASR evaluation under realistic acoustic conditions.
License Icon

License: CC0-1.0

Locale Icon

Locale: ko

Task Icon

Task: ASR

Format Icon

Format: MP3, JSONL

Size Icon

Size: 21.01 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Literature Corpus

A Torwali literature corpus (~233K tokens) covering poetry, folklore, biographies, and cultural texts for linguistic research and NLP development.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: trw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 488.12 KB

Institute of African Digital Humanities

Bulu_ALCAM-MultimodalDataset

Bulu ALCAM multimodal dataset: lexical entries and example sentences in Bulu (IPA) with French equivalents, audio recordings, and alignment file.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bum

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 31.28 MB

Institute of African Digital Humanities

Hausa-TTS-Dataset

This dataset consists of segmented Hausa speech audio clips paired with text, totalling 5h 25m and 38s.
License Icon

License: NOODL-1.0

Locale Icon

Locale: hau

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 276.90 MB

MirasAI

Tamil Time Aligned Speech Dataset

5-hour Tamil speech dataset with time-aligned transcripts, designed for ASR, forced alignment, subtitle generation, and speech-language research.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: tam

Task Icon

Task: ASR

Format Icon

Format: OGG, SRT

Size Icon

Size: 37.11 MB

MDC Curators

ViQua² — Visual Question-answering about Quantities

Multimodal evaluation dataset for quantity-based visual question answering.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: en-US

Task Icon

Task: CV

Format Icon

Format: JSON, JPEG

Size Icon

Size: 281.05 MB

Institute of African Digital Humanities

Bamun-TTS-Dataset

This dataset consists of segmented Bamun (Shupamem) speech audio clips paired with text, designed for Text-to-Speech (TTS) applications.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bax

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 219.97 MB

GriôTech

Territórios Digitais

Dataset on community-driven responses to disinformation and AI in marginalized territories in Brazil, based on participatory research.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: pt, en

Task Icon

Task: N/A

Format Icon

Format: DOCX, PDF, XLSX

Size Icon

Size: 4.24 MB

Taruen

Chuvash TTS

A ~5-hour speech dataset for Chuvash Text-to-Speech (TTS) research, featuring a single female speaker reading news and digits at a rapid tempo.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: cv

Task Icon

Task: TTS

Format Icon

Format: PARQUET

Size Icon

Size: 854.02 MB

RFERL

RFE/RL Persian News Text Corpus

This dataset is a longitudinal news corpus for the Persian language sourced from Radio Farda from 2001 to 2026. It contains over 350,000 articles (51M tokens).
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: fa

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 307.78 MB

MirasAI

Saraiki 10 Hours TTS Dataset

A 10-hour Saraiki text-to-speech dataset consisting of recorded speech and aligned transcripts, designed for speech synthesis research and development.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: srk

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 584.44 MB

MirasAI

Kannada Time Aligned Speech Corpus

A 5-hour Kannada speech dataset with time-aligned transcriptions, designed for ASR, forced alignment, and speech research.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: kan

Task Icon

Task: ASR

Format Icon

Format: OGG, SRT

Size Icon

Size: 355.77 MB

IT'S EASY TO UPLOAD & CONTROL YOUR DATA

Upload your dataset

An illustration of a floppy disks

Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it. You can share openly, using existing licenses, or you can build your own.

An illustration of a floppy disks

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.


How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at support@mozilladatacollective.com.


Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.