MDC Logo

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 470+ high-quality global datasets, built by and for the community in a transparent and ethical way.

Datasets

MirasAI

Rangpuri (অংপুরি Ôṅgpuri) Text Corpus

A 500K token corpus of Rangpuri literature and drama provided in UTF-8 format for linguistic research and NLP development.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: rkt

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 3.68 MB

MirasAI

Chittagonian (চাটগাঁইয়া, saṭgãia) Text Corpus

A 690K token corpus of raw Chittagonian text covering drama, poetry, and folklore for linguistic research and NLP development.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ctg

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 6.59 MB

MDC Community Concierge

Speech Corpus of English Learners from Mexico

A corpus of read speech by learners of English living in Mexico.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: en

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 201.79 MB

Kaltepetlahtol

Highland Puebla Nahuatl Spoken Image Descriptions

100 culturally-salient images from the Sierra Norte of Puebla, Mexico, with spoken descriptions in Highland Puebla Nahuatl.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: azz

Task Icon

Task: CV

Format Icon

Format: JPG, WEBM, JSON

Size Icon

Size: 64.31 MB

Universidad Nacional Autónoma de México, UNAM

Archivo GELED: Muestra general de audios del cuicateco

Corpus de 3 horas de audio transcrito fonéticamente de diferentes comunidades de habla cuicateca.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: cux, cut

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 819.45 MB

MDC Community Concierge

IsiZulu Second Language Learner Speech Corpus

Gold standard recordings from isiZulu teachers and recordings from L2 learners annotated by isiZulu teachers for phonemic and tonal pronunciation errors .
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: zu

Task Icon

Task: CALL

Format Icon

Format: WAV, SQLITE

Size Icon

Size: 5.26 GB

EELLAK - GreekFOSS

Modern Greek Dictionary

This dataset is a structured digital export of the Triantafyllides Modern Greek Dictionary.
License Icon

License: CC BY-NC-ND 4.0

Locale Icon

Locale: gr-GR

Task Icon

Task: NLP

Format Icon

Format: PARQUET

Size Icon

Size: 12.02 MB

EELLAK - GreekFOSS

ERT Press

Structured digital collection of press releases and news articles from the official platform of ERT, Greece's national public broadcaster.
License Icon

License: CC BY-NC-ND 4.0

Locale Icon

Locale: gr-GR

Task Icon

Task: NLP

Format Icon

Format: PARQUET

Size Icon

Size: 32.60 MB

Community

Ladino-Spanish Lexical Resources

Dictionary, word list, and verb conjugation files for Ladino (Judeo-Spanish) and Spanish, compiled for a rule-based translation system.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: lad, spa

Task Icon

Task: MT

Format Icon

Format: TXT

Size Icon

Size: 39.92 KB

Institute of African Digital Humanities

Yoruba-TTS-Dataset

This dataset consists of segmented Yoruba speech audio clips paired with text, designed for Text-to-Speech (TTS) applications.
License Icon

License: NOODL-1.0

Locale Icon

Locale: yor

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 319.05 MB

Community

Şalom Ladino Corpus

Text corpus of 176,843 words compiled from 397 Judeo-Espanyol articles from Şalom newspaper.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: lad

Task Icon

Task: LM

Format Icon

Format: TXT

Size Icon

Size: 403.16 KB

Community

Ladino: Una Fraza al Diya

307 Ladino (Judeo-Spanish) language learning sentences with translations in Turkish, English, and Spanish, plus audio recordings and images.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: lad

Task Icon

Task: NLP

Format Icon

Format: OGG, JPEG, TSV

Size Icon

Size: 76.35 MB

IT'S EASY TO UPLOAD & CONTROL YOUR DATA

Upload your dataset

An illustration of a floppy disks

Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it. You can share openly, using existing licenses, or you can build your own.

An illustration of a floppy disks

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.


How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at support@mozilladatacollective.com.


Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.