Multilingual Audio Visual Speech (Lip-Reading) Dataset

Overview

The dataset was created using the Mozilla Data Collective tool MoLiÈRe. MoLiÈRe produces an audio-visual speech dataset in similar format used in academic lip-reading research. The dataset was crowdsourced in noisy live events. The speakers were presented with a Contribution Form that they had to fill in to give explicit consent on sharing their anonymised data under a CC-by-SA-4.0 license. They contributed voluntarily and they were given the option to share age and gender metadata if they wished. For each sentence recorded, the archive contains:

A silent video (120×120 pixels) showing only the mouth region, cropped in real time using face detection.
A separate audio file of the speaker's voice.

The two files are paired by a shared ID in dataset.tsv.

Dataset Statistics

Property	Value
Total samples	100
Unique speakers	16
Languages	7
Video format	WebM (silent, 120×120 px, mouth-only)
Audio format	WebM

Languages

Language	Samples
Catalan (Català)	62
English	15
Spanish	11
Portuguese	3
Japanese	3
Malayalam	3
Telugu	3

Speaker Demographics

Speakers self-reported their age group and gender.

Age Group	Speakers
18–29	25
30–39	71
40–49	4

Gender	Speakers
Female	24
Male	76

Schema

Each row in dataset.tsv contains the following columns:

Column	Description
`index`	Row number (starts from 1)
`speaker_id`	Pseudonymous speaker identifier
`id`	Unique sample ID (shared prefix of both media files)
`video_file`	Filename of the silent mouth-region video (`_video.webm`)
`audio_file`	Filename of the audio recording (`_audio.webm`)
`transcription`	Verbatim text of the sample
`language`	Language of the sample
`age`	Self-reported age group of the speaker
`gender`	Self-reported gender of the speaker

Privacy

Because only the mouth region is captured at low resolution (120×120 px), the dataset does not reveal the speaker's face or identity. Speaker IDs are random pseudonyms with no link to real-world identities. The contributors have the right to withdraw their data at any time by contacting support@mozilladatacollective.com and providing their private unique identifier. Mozilla Data Collective has stored securely a mapping of private unique identifiers to the pseudonymous speaker IDs used in the dataset, allowing for data withdrawal while maintaining anonymity.

Collection Method

Samples were recorded voluntarily via the MoLiÈRe web interface in a very noise environment. Contributors selected their language, read prompted sentences, and consented to their anonymised recordings being shared publicly under a CC-by-SA-4.0 license.

Limitations

The dataset is small (100 small) and is not intended as a standalone training set for production lip-reading systems.
Language coverage is uneven: Catalan accounts for the majority of samples.
Only a single speaker contributed most Catalan samples, limiting speaker diversity for that language.

Multilingual Audio Visual Speech (Lip-Reading) Dataset

Description

Specifics

Considerations

Processes

Metadata