Task: OTH
Release Date: 5/20/2026
Format: WEBM, TSV
Size: 32.18 MB
Share
Community-sourced dataset of anonymised mouth-only video with separate audio track file and transcribed text in the format used for training and evaluating lip-reading (visual speech recognition) models. 7 different languages from 16 different anonymous speakers in a noisy environment.
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlRestrictions/Special Constraints
N/A
Forbidden Usage
It is forbidden to attempt to determine the identity of speakers. It is forbidden to re-host or re-share this dataset.
Ethical Review
This dataset was crowsourced in noisy live events. The speakers were presented with a Contribution Form that they had to fill in to give explicit consent on sharing their anonymised data under a CC-by-SA-4.0 license. They contributed voluntarily and they were given the option to share age and gender metadata if they wished.
Intended Use
- Building a lip-reading training corpus for a low-resource language - Collecting visual speech data for audio-visual ASR experiments - Creating pronunciation demonstrations where speaker identity should not be visible
The dataset was created using the Mozilla Data Collective tool MoLiÈRe. MoLiÈRe produces an audio-visual speech dataset in similar format used in academic lip-reading research. The dataset was crowdsourced in noisy live events. The speakers were presented with a Contribution Form that they had to fill in to give explicit consent on sharing their anonymised data under a CC-by-SA-4.0 license. They contributed voluntarily and they were given the option to share age and gender metadata if they wished. For each sentence recorded, the archive contains:
A silent video (120×120 pixels) showing only the mouth region, cropped in real time using face detection.
A separate audio file of the speaker's voice.
The two files are paired by a shared ID in dataset.tsv.
| Property | Value |
|---|---|
| Total samples | 100 |
| Unique speakers | 16 |
| Languages | 7 |
| Video format | WebM (silent, 120×120 px, mouth-only) |
| Audio format | WebM |
| Language | Samples |
|---|---|
| Catalan (Català) | 62 |
| English | 15 |
| Spanish | 11 |
| Portuguese | 3 |
| Japanese | 3 |
| Malayalam | 3 |
| Telugu | 3 |
Speakers self-reported their age group and gender.
| Age Group | Speakers |
|---|---|
| 18–29 | 25 |
| 30–39 | 71 |
| 40–49 | 4 |
| Gender | Speakers |
|---|---|
| Female | 24 |
| Male | 76 |
Each row in dataset.tsv contains the following columns:
| Column | Description |
|---|---|
index | Row number (starts from 1) |
speaker_id | Pseudonymous speaker identifier |
id | Unique sample ID (shared prefix of both media files) |
video_file | Filename of the silent mouth-region video (_video.webm) |
audio_file | Filename of the audio recording (_audio.webm) |
transcription | Verbatim text of the sample |
language | Language of the sample |
age | Self-reported age group of the speaker |
gender | Self-reported gender of the speaker |
Because only the mouth region is captured at low resolution (120×120 px), the dataset does not reveal the speaker's face or identity. Speaker IDs are random pseudonyms with no link to real-world identities. The contributors have the right to withdraw their data at any time by contacting support@mozilladatacollective.com and providing their private unique identifier. Mozilla Data Collective has stored securely a mapping of private unique identifiers to the pseudonymous speaker IDs used in the dataset, allowing for data withdrawal while maintaining anonymity.
Samples were recorded voluntarily via the MoLiÈRe web interface in a very noise environment. Contributors selected their language, read prompted sentences, and consented to their anonymised recordings being shared publicly under a CC-by-SA-4.0 license.
The dataset is small (100 small) and is not intended as a standalone training set for production lip-reading systems.
Language coverage is uneven: Catalan accounts for the majority of samples.
Only a single speaker contributed most Catalan samples, limiting speaker diversity for that language.